# Preparing adata for cellxgene / MaMPlan creation

### Preparing for cellxgene
This Notebook prepares the anndata object for cellxgene.
This preparation includes:
 - Removing unnessesary data to keep the resulting h5ad file as small as possible
 - Renaming columns for a nicer presentation in cellxgene
 - Converting unsupported datatypes to supported datatypes
 - Additional fixes for bugs between scanpy, anndata and cellxgene
   
### MaMPlan creation
Additionally, a MaMPlan can be created which is needed to deploy the dataset to the BCU repository using mampok or the BCU repository overlay.  
A MaMPlan acts as the config file for each specific dataset. It holds a variety of different parameters needed by mampok and the BCU repository.  
To simplyfy the creation process only the importent parameters can be set. The other parameters get a (often) required default value.

See the [MaMpok wiki](https://gitlab.gwdg.de/loosolab/software/mampok/-/wikis/Getting-Started/MaMPlan_keys) for more detailed information about each parameter.

#### Parameters
| Parameter | Description | Options |
|:---:|:---|:---|
| project_id | Project ID, e.g. 'ext123', 'dst123' | str |
| tool | Select the cellxgene docker container. | 'cellxgene-new', 'cellxgene-fix', 'cellxgene-vip-latest'  |
| cluster | Select the kubernetes cluster. | 'BN', 'GI', 'GWDG', 'GWDGmanagt' |
| organization | Select organizations related to the project.<br> Every user in one of the organizations will be able to access the dataset via the BCU repository. | [Options](https://gitlab.gwdg.de/loosolab/software/metadata_whitelists/-/blob/main/whitelists/department?ref_type=heads) |
| label | Set label shown in the browser tab. | str |
| user | List of users that, additonally to the organization, get access to the dataset via the BCU repository.  | List of LDAP user IDs |
| owner | Owner / Responsible person of the dataset. Set to public if public dataset.| LDAP user ID or public |
| pubmedid | Pubmed ID of public datasets. | Pubmed user ID |
| citation | Citation of public dataset. | str |
| cpu_limit | Set the limit of cpu cores that can be used by the deplyoment. | int |
| mem_limit | Set the limit (in GB) of memory that can be used by the deplyoment. | int |
| cpu_request | Set the requested amount of cpu cores that can be used by the deplyoment. | int |
| mem_request | Set the requested amount (in GB) of memory that can be used by the deplyoment. | int |

## Setup

In [None]:
import sctoolbox.utilities as utils
from sctoolbox import settings

## Input - 1

In [None]:
# sctoolbox settings
settings.adata_input_dir = "pipeline_output/adatas/"
settings.adata_output_dir = "./"
settings.log_file: "prepare_for_cellxgene_log.txt"
last_notebook_adata = "anndata_4.h5ad"
datatype = "scRNA"

# MaMPlan options

## Project options
project_id = "Test-ID"
tool = "cellxgene-fix" #cellxgene-vip-latest
cluster = "BN"
organization = ["AG-nerds"]
label = None
user = None
owner = "Test-owner"

## Options for public datasets
pubmedid = None
citation = None

## Options for computational resource manangemnt

### Limit
cpu_limit = None
mem_limit = None
### Requested
cpu_request = None
mem_request = None

mamplan_filename = f"{project_id}_MaMPlan.yaml"

## Load data

In [None]:
adata = utils.load_h5ad(last_notebook_adata)
display(adata)

## Prepare adata for cellxgene

### Input - 2 

In [None]:
# Keep columns in adata.obs (Cell metadata)
keep_obs = [
    "sample",
    "batch",
    "celltype",
    "pct_counts_is_mito",
    "pct_counts_is_ribo",
    "phase",
    "clustering",
    "SCSA_pred_celltype",
    "marker_pred_celltype"
]

# Rename columns in adata.obs
rename_obs = {
    "sample": "Sample",
    "batch": "Batch",
    "celltype": "Celltype",
    "pct_counts_is_mito": "Mitochondiral content (%)",
    "pct_counts_is_ribo": "Ribosomal content (%)",
    "phase": "Phase",
    "clustering": "Final Clustering",
    "SCSA_pred_celltype": "Predicted Celltype (SCSA)",
    "marker_pred_celltype": "Predicted Celltype (Marker)"
}

# Keep columns in adata.var (Gene metadata)
# An empty list removes all columns
keep_var = []
rename_var = {}

### Add leiden columns

In [None]:
leiden_cols = [col for col in adata.obs.columns if col.startswith("leiden")]
keep_obs += leiden_cols
rename_obs |= {c: c.replace("_", " ").capitalize() for c in leiden_cols}

The cellxgene preparation removes all data from the anndata object that is not required for the cellxgene deplyoment.  
This saves memory on the cluster and increases runtime.

In addition, every invvalid or problematic datatype is checked for and casted to a fitting datatype if possible.

**Note: Keep in mind that the resulting adata object should not be used for further analysis.**

In [None]:
utils.prepare_for_cellxgene(adata,
                           keep_obs=keep_obs,
                           keep_var=keep_var,
                           rename_obs=rename_obs,
                           rename_var=rename_var,
                           inplace=True)

In [None]:
display(adata)

### Save adata

In [None]:
#Saving the data
adata_output = f"{project_id}_cellxgene.h5ad"
utils.save_h5ad(adata, adata_output)

## Write MaMPlan

In [None]:
try:
    from mampok import mamplan_creator
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the latest mampok version.")

In [None]:
mamplan = mamplan_creator.SimpleMamplan(
    exp_id = project_id,
    files = adata_output,
    tool = tool,
    analyst = utils.get_user(),
    datatype = datatype,
    cluster = cluster,
    label = label,
    organization = organization,
    user = user,
    owner = owner,
    pubmedid = pubmedid,
    citation = citation,
    cpu_limit = cpu_limit,
    mem_limit = mem_limit,
    cpu_request = cpu_request,
    mem_request = mem_request
)

In [None]:
mamplan.save(f"{settings.adata_output_dir}/{project_id}")