In [None]:
from sctoolbox.utils.jupyter import bgcolor, _compare_version

nb_name = "annotation.ipynb"

_compare_version(nb_name)

# Cell type annotation and marker list assembly
<hr style="border:2px solid black"> </hr>

## 1 - Description

**Requires a clustered or otherwise categorized anndata object. A clustering can be generated with a clustering notebook (e.g. `rna_analysis/notebooks/04_clustering.ipynb`).**

**Move this notebook into the notebook folder (e.g. `rna_analysis/notebooks/`) of the respective analysis before using it!**

This Jupyter Notebook is designed for annotating cell types in clustered AnnData objects. It is divided into two main parts:

- **Marker List Assembly**: This part is used when no existing marker lists are available. It enables users to assemble custom marker lists using the MarkerRepo.

- **Annotation**: This section applies the created or provided marker lists to annotate cell types in AnnData objects.

The parameters are organized in three tables:
1. The first table contains basic parameters necessary for the annotation process.
2. The second table lists parameters specific to the Marker List Assembly section.
3. The third table lists parameters related to the Annotation section.

For a basic analysis, the parameters in the first table should be sufficient. However, for more advanced fine-tuning and detailed control of the analysis, the parameters in the second and third tables become critical.


## 1.1 - Parameter Overview

### 1.1.1 - Essential input data

| Parameter | Description | Options |
|-----------|-------------|--------------|
| `clustered_adata` | Name of the clustered AnnData file for use. | String |
| `clustering_column` | `.obs` column used for cell type assignment. | `None` (select interactively) or String (e.g., `"leiden"`) |
| `marker_lists` | Paths to marker lists. If `None`, assemble lists using MarkerRepo. | `None` or String or list of Strings (e.g., `"/path/my_markers"` or `["/heart_markers/markers", "/human/panglao"]` |

A **custom marker list** is a text (`.csv`, `.tsv`, ...) file with two columns, first column is marker name, second is cell type,

with no header and tab seperated, example:

```
marker_1    Fibroblast

marker_2    Fibroblast

marker_3    Endocardium

...
```


### 1.1.2 - Marker List Assembly

| Parameter | Description | Options |
|-----------|-------------|---------|
| `organism` | Specifies the organism for marker list assembly. | `None` or String (e.g., `"human"`) |
| `column_specific_terms` | Search terms for marker list assembly, targeting specific columns. | `None` or Dictionary (e.g., `{"Source": "panglao.se"}`) |
| `cml_parameters` | Additional parameters for marker list assembly. One marker list is created per dictionary. | `None` or List of dictionaries (e.g., `[{"style":"two_column", "file_name":"two_column"}, {"style":"score", "file_name":"score"}]`|
| `repo_path` | Path to MarkerRepo. | String |
| `lists_path` | Path to a custom marker lists folder. If `None`, the lists folder of the `repo_path` will be used. | `None` or String (e.g., `"/path/my_markers"`) |
| `style` | The style of the marker lists. Options include "two_column" and "score". | String |
| `file_name` | The name of the exported marker lists. | `None` (enter interactively) or String |

If `column_specific_terms` and `cml_parameters` are `None`, you can assemble marker lists interactively.

The following columns are currently available for the MarkerRepo query: `"ID"`, `"List name"`, `"Date"`, `"Source"`, `"Organism name"`, `"Taxonomy ID"`, `"Submitter name"`, `"Email"`, `"Tags"`, `"Genotype"`, `"Gender"`, `"Life stage"`, `"Tissue"` and more.


### 1.1.3 - Annotation Parameters

| Parameter | Description | Options/Type |
|-----------|-------------|--------------|
| `marker_repo` | Use MarkerRepo for annotation. | Boolean |
| `SCSA` | Use SCSA for annotation. | Boolean |
| `mr_obs` | `.obs` prefix for MarkerRepo annotation. | String (e.g., "mr") |
| `scsa_obs` | `.obs` prefix for SCSA annotation. | String (e.g., "scsa") |
| `rank_genes_column` | Column of `.uns` table with rank genes scores. If `None`, the ranking will be performed on the clustering_column. | `None` or String |
| `reference_obs` | A reference annotation in `.obs` for comparison. | `None` or String |

For more information about MarkerRepo, click [here](https://gitlab.gwdg.de/loosolab/software/annotate_by_marker_and_features).

--------------

## 2- Setup

In [None]:
from sctoolbox import settings
import sctoolbox.utils as utils
import sctoolbox.plotting as pl
import pandas as pd
pd.set_option('display.max_columns', None)  # no limit to the number of columns shown

In [None]:
try:
    import markerrepo.wrappers as wrap
    import markerrepo.marker_repo as mr
except ModuleNotFoundError:
    raise ModuleNotFoundError("Please install the latest MarkerRepo version.")

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
%bgcolor PowderBlue

# sctoolbox settings
settings.adata_input_dir = "../adatas/"
settings.adata_output_dir = "../adatas/"
settings.figure_dir = "../figures/annotation/"
settings.table_dir = "../tables/annotation/"
settings.log_file = "../logs/annotation_log.txt"

clustered_adata = "anndata_4.h5ad"

___

## 3 - Loading adata

In [None]:
adata = utils.adata.load_h5ad(clustered_adata)

In [None]:
with pd.option_context("display.max.rows", 5, "display.max.columns", None):
    display(adata)
    display(adata.obs)
    display(adata.var)

___

## 4 - Essential Input

### Available organisms
* organisms available for marker list assembly in case you don't provide a custom list (or lists):
```
'human', 'mouse', 'zebrafish', 'rat'
```

* If you provide at least one custom marker list in `marker_lists`, the parameter `organism` will not be used and the **Assemble marker lists** step is skipped

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
%bgcolor PowderBlue

# Annotation settings
clustering_column = "clustering"
organism = "human"
# set path to custom marker lists
marker_lists = None

# add the path to annotate_by_marker_and_features repo
repo_path = "../annotate_by_marker_and_features"
lists_path = None  # set to None to use all lists in the lists folder, or set the path to a folder in
                   # repo which contains the lists you want

--------------

## 5 - Assemble marker lists

In [None]:
if not marker_lists and not organism:
    raise ValueError("Please provide either <organism> or a path to custom marker list <marker_lists>")
if not marker_lists:
    df = mr.search_df(df=mr.combine_dfs(repo_path=repo_path, lists_path=lists_path), col_to_search="Organism name", search_terms=[f"+{organism.split(' ')[0]}"])
    print(f"* Possible keys for <column_specific_terms>:\n {df.columns.to_list()}\n")
    for col in df.columns[:12]: 
        print(f"* Possible values for {col}: {df[col].dropna().drop_duplicates().to_list()}\n")

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
%bgcolor PowderBlue

# Marker list assembly
if not marker_lists:
    # we recommend specifying "Tissue" if possible to get more accurate results
    column_specific_terms = {"Organism name": organism, "Tissue": "esophagus"}

    cml_parameters = [{"file_name":"panglao_two_column", "style":"two_column"}, 
                      {"file_name":"panglao_score", "style":"score"},
                      #{"file_name":"panglao_ui", "style":"ui"}
                     ]

___

In [None]:
if not marker_lists:
    marker_lists = wrap.create_multiple_marker_lists(
        cml_parameters=cml_parameters, 
        repo_path=repo_path, 
        lists_path=lists_path,
        organism=organism, 
        ensembl=mr.check_ensembl(adata), 
        column_specific_terms=column_specific_terms, 
        show_lists=True,
        path=settings.table_dir
    )

--------------

## 6 - Annotate adata

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
%bgcolor PowderBlue

marker_repo = True
SCSA = True
mr_obs = "MR"
scsa_obs = "SCSA"
rank_genes_column = None
reference_obs = None

___

In [None]:
compare_df = wrap.run_annotation(adata, 
                                 marker_repo=marker_repo, 
                                 SCSA=SCSA, 
                                 marker_lists=marker_lists, 
                                 mr_obs=mr_obs, 
                                 scsa_obs=scsa_obs, 
                                 rank_genes_column=rank_genes_column, 
                                 clustering_column=clustering_column, 
                                 reference_obs=reference_obs, 
                                 show_comparison=True, 
                                 ignore_overwrite=True, 
                                 show_plots=False,
                                 output_path=settings.table_dir
                                )

In [None]:
if not rank_genes_column:
    rank_genes_column = f"rank_genes_groups_{clustering_column}"

# Plot dotplot of markers
_ = pl.marker_genes.rank_genes_plot(
    adata,
    key=rank_genes_column,
    n_genes=10,
    style="dots",
    save=f"marker_genes_dots_{clustering_column}.pdf"
)

In [None]:
# Plot cell type annotations
columns = [clustering_column] + list(compare_df.columns)
_ = pl.embedding.plot_embedding(adata, method="umap", color=columns, ncols=2,
                                save="compare_annotations.pdf")

--------------

### 6.1 - Show annotated .obs table

In [None]:
display(adata.obs)

--------------

## 7 - Save adata

In [None]:
utils.adata.save_h5ad(adata, "anndata_annotated.h5ad")