In [None]:
from sctoolbox.utils.jupyter import bgcolor, _compare_version

# change the background of input cells
bgcolor("PowderBlue", select=[2, 4, 5])

nb_name = "99-report.ipynb"
_compare_version(nb_name)

# 01 - Analysis report
<hr style="border:2px solid black"> </hr>

## 1 - Description
This notebook creates an analysis report PowerPoint by collecting key information, e.g. plots, versions and thresholds, created during the individual analysis steps and combines them into a comprehensive report detailing the whole analysis.

______

## 2 - Setup

In [None]:
from pathlib import Path
import sctoolbox.tools.report as sc_report

______

## 3 - Report options and content selection
<hr style="border:2px solid black"> </hr>

### 3.1 Global report settings
The following box provides options concerning the whole report.

|     Name     | Description | Default |
|--------------|-------------|---------|
|`dataset_name`|The name of the dataset. Will be displayed on the title slide.|-|
| `report_dir` |The directory that contains all of the report information.|`../report/`|
|  `template`  |A PowerPoint providing template slides. Used for the slide design.|`scRNA-template.pptx`|
| `slide_size` |Set the slide format. Can be “standard”, “widescreen”, “a4-portait”, “a4-landscape” or a tuple of numbers indicating (height, width) in cm.|`widescreen`|
| `max_pixels` |The maximum number of pixels an image is allowed to have. Images exceeding the size are automatically scaled down.|`5e7`|
|  `file_ext`  | A list of the file extensions that will be used to fill the slides with content.|`["png", "md", "txt"]`|

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
dataset_name = "ext12345"

# advanced options
report_dir = "../report/"
template = "scRNA-template.pptx"
slide_size = "widescreen"
max_pixels = 5e7
file_ext = ["png", "md", "txt"]

### 3.2 Sections
Select the sections that should be displayed in the final PowerPoint report.

**No changes required unless the section order or the displayed section names need changes or a new section should be included.**

In [None]:
print("Found the following sections to report:")
for sec in sorted(list(Path(report_dir).glob("*"))):
    if not Path(sec).is_dir():
        continue

    print(f"    - {sec.name}")

**Select** and **order** the sections of the final report PowerPoint by editing the dictionary below.  

`"<section_id>": "<title displayed in the slide>"`  
The *section_id* has to be in the list above. Sections that are specified below but not available are ignored.

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
# a dict to define section order and the displayed titles
section_titles = {
    "01_assembly": "Dataset Assembly",
    "02_QC": "Quality Control",
    "03_batch_correction": "Normalization and Noise Reduction",
    "04_clustering": "Embedding and Clustering",
    "group_markers": "Group Marker prediction",
    "annotation": "Cell Type Annotation",
    "GSEA": "Gene Set Enrichment Analysis",
    "0A1_receptor_ligand": "Receptor-Ligand Analysis",
    "0A2_receptor_ligand_differences": "Receptor-Ligand Difference Analysis",
    "0B_velocity": "RNA Velocity Analysis",
    "proportion_analysis": "Proportion Analysis",
    "pseudotime_analysis": "Pseudotime Trajectory Analysis"
}

### 3.3 Slide options
<hr style="border:1px solid black"> </hr>
The dictionary below holds all the information needed to format and populate the slides of the report.

**No changes needed unless slide order or slide content needs to be changed or new slides are required.**

#### Structure
The top-level keys are the *section ids*, the corresponding value can either be a dictionary or a list containing dictionaries. The bottom-level dictionary provides the slide options. Available slide options can be found [here (PowerPointReport.add_slide)](https://loosolab.pages.gwdg.de/software/pptreport/API/index.html#pptreport.powerpointreport.PowerPointReport.add_slide).

<h1><center>⬐ Fill in input data here ⬎</center></h1>

In [None]:
slide_section_args = {
    "01_assembly": {
        "title": "Dataset Assembly",
        "content": [
            "../report/01_assembly/01_assembly.png",
            "../report/01_assembly/01_dataset_overview.md",
            "../report/01_assembly/01_obs.png",
            "../report/01_assembly/01_var.png"
        ],
        "notes": [
            "A dataset may be constructed from multiple parts (samples, sequenceing batches, etc.). The first table shows the individual parts and their location from which the dataset is constructed.",
            "\n",
            "The text provides a short overview of the dataset dimensions and available information. In a typical scRNA experiment each observation translates to a cell and each variable to a gene. This means that observation information contains information about each cell, e.g. the timepoint a cell belongs to. Likewise, provides variable information details about individual genes, for example how many cells had expression of a certain gene.",
            "Datasets can contain additional layers, i.e. data matrixes, such as raw expression or normalized expression values."
            "The last two tables show a detailed view on the observation (obs) and variable (var) of the dataset.",
        ],
        "content_alignment": "left",
        "content_layout": "vertical"
    },
    "02_QC": [
        {
            "title": "Gender prediction",
            "notes": "The gender of each sample is predicted based on the expression of a female gene. Female samples are shown with a red background.",
            "content": "../report/02_QC/01*"
        },
        {
            "title": "Cell filter I",
            "content": "../report/02_QC/02*",
            "notes": [
                "Low quality cells are removed from the dataset by applying thresholds to quality metrics. This is either done using global thresholds or thresholds for each individual sample.",
                "\n",
                "Common quality metrics:",
                "n_genes:			The number of genes associated with a barcode (cell).",
                "total_counts:			The total amount of reads detected for this barcode (cell).",
                "log1p_total_counts:		Same as above but on a logarithmic scale.",
                "total_counts_is_mito:		The total amount of reads associated with mitochondrial genes.",
                "log1p_total_counts_is_mito:		Same as above but on a logarithmic scale.",
                "pct_counts_is_mito:		Percentage of mitochondrial reads per barcode (cell).",
                "total_counts_is_ribo:		The total amount of reads associated with ribosomal genes.",
                "log1p_total_counts_is_ribo:		Same as above but on a logarithmic scale.",
                "pct_counts_is_ribo:		Percentage of ribosomal reads per barcode (cell).",
                "total_counts_is_gender:		The total amount of reads associated with gender genes.",
                "log1p_total_counts_is_gender:	Same as above but on a logarithmic scale.",
                "pct_counts_is_gender:		Percentage of gender related reads per barcode (cell)."
            ]
        },
        {
            "title": "Cell filter II",
            "content": "../report/02_QC/03*",
            "notes": "This plot provides a pairwise view of the quality metrics, their distributions, and the respective thresholds."
        },
        {
            "title": "Cell filter III",
            "content": [
                "../report/02_QC/04_cell_filter_impact.png",
                "../report/02_QC/04_doublet_info.txt",
                "../report/02_QC/04_cell_filter_info.txt"
            ],
            "notes": [
                "The UpSetPlot provides the impact each threshold has on the dataset. This is shown for each metric alone and in combination with other metrics to estimate the overlap between the filters.",
                "The final amount of filtered cells is given in the text.",
                "The dataset may also be filtered for doublets, observations that contain more than one cell, to avoid skewing the following analysis."
            ],
            "content_layout": "vertical",
            "height_ratios": [0.8, 0.1, 0.1]
        },
        {
            "title": "Gene filter I",
            "content": "../report/02_QC/05*",
            "notes": [
                "Similar to cell filtering, genes may also be removed from the dataset with the aid of quality metrics.",
                "\n",
                "Common quality metrics:",
                "n_cells_by_counts:	The number of cells that contain reads associated with the gene.",
                "mean_counts:		The mean amount of reads over all cells.",
                "log1p_mean_counts:	Same as above but on a logarithmic scale.",
                "pct_dropout_by_counts:	Percentage of cells this gene does not appear in.",
                "total_counts:		The total amount of reads associated to this gene.",
                "log1p_total_counts:	Same as above but on a logarithmic scale."
            ]
        },
        {
            "title": "Gene filter II",
            "notes": "Genes are categorized into mitochondrial, ribosomal or gender related. This is either based on a list of genes or a search string e.g. genes starting with 'mt' for mitochondrial genes. Genes related to mitochondrial, ribosomal or gender functions are optionally filtered from the dataset.",
            "content": [
                "../report/02_QC/06_gene_labels.png",
                "../report/02_QC/06_*info.txt"
            ],
            "content_layout": "vertical",
            "height_ratios": [0.7, 0.1, 0.1, 0.1]
        },
        {
            "title": "Dataset denoising",
            "content": "../report/02_QC/07*",
            "notes": "We remove ambient RNA and technical noise from the count matrix using scAR. The tool estimates the ambient profile by averaging cell-free droplets. An autoencoder neural network later corrects the count matrix."
        }
    ],
    "03_batch_correction": [
        {
            "title": "Cell Cycle",
            "notes": "The cell cycle phase of each cell is predicted based on the activity of known cell cycle genes. The plot shows the distribution of cell cycle phases across the samples.",
            "content": "../report/03_batch_correction/01*"
        },
        {
            "title": "PCA",
            "content": "../report/03_batch_correction/02*",
            "notes": [
                "A principle component analysis (PCA) is computed to reduce the dimensionality by reorganizing the data into principle components (PCs). The plot shows the first two PCs with each dot (cell) colored for quality metrics.",
                "There should be no gradient or grouping visible to provide a clean biological signal."
            ]
            
        },
        {
            "title": "PCA - component subset",
            "notes": [
                "Each principle component (PC) is evaluated based on its correlation to quality metrics and variance. PCs exceeding one of the two thresholds are filtered from the dataset, indicated as a lightgrey bar.",
                "Removing PCs helps to reduce technical or other unwanted noise to create a clean biological signal."
            ],
            "content": "../report/03_batch_correction/03*"
        },
        {
            "title": "Batch correction",
            "content": "../report/03_batch_correction/04*",
            "width_ratios": [0.4, 0.6],
            "notes": [
                "Batch correction aims to reduce technical differences in the dataset. For example, two separate sequencing runs may show as two distinct groups despite being biologically similar. There are multiple batch correction methods available with different concepts causing them to perform differently for each dataset. The plot compares multiple batch correction methods (columns) to select the optimal one for the dataset.",
                "The plot shows PCAs and preliminary UMAPs to assess how well the batches are mixed after batch correction. The bottom plots show the LISI score, a measurement for the 'mixedness' of the batches (maximum = number of batches). A high LISI is not a sufficient indicator, the PCA and UMAP shapes should be included into the consideration as well."
            ]
        }
        
    ],
    "04_clustering": [
        {
            "title": "Embedding I",
            "content": "../report/04_clustering/01*",
            "width_ratios": [0.7, 0.3],
            "notes": [
                "The final dimension reduction step is to compute a 2D embedding of the dataset. This can either be a Uniform Manifold Approximation and Projection (UMAP) or distributed stochastic neighbor embedding (t-SNE).",
                "Both methods rely on two main parameters to compute the respective embedding, which are shown in the table. The plot shows the embedding colored for various quality measurements. Ideally, there shouldn't be a gradient or grouping visible, unless expected through the experimental design.",
                "Note: Dot = Cell"
            ]
        },
        {
            "title": "Embedding II",
            "content": "../report/04_clustering/02*",
            "notes": "This plot shows how the cells for each sample/ condition are distributed across the dataset."
        },
        {
            "title": "Clustering I",
            "content": "../report/04_clustering/03*",
            "notes": "The embedding colored for sample/ condition (left) and cells assigned to clusters (right). Clustering is typically done using the leiden algorithm."
        },
        {
            "title": "Clustering II",
            "content": "../report/04_clustering/04*",
            "notes": [
                "Here, the number of cells per clusters is shown. The color shows the proportion of cells assigned to samples/ conditions.",
                "Small clusters or clusters predominantly from one sample/ condition may be outliers or part of a bigger cluster, as such, they may be removed or combined with another cluster. However, this is tied to the expectations of the individual dataset and must be decided on a case-by-case basis."
            ]
        }
    ],
    "0A1_receptor_ligand": [
        {
            "title": "Database",
            "content": "../report/0A1_receptor_ligand/01*",
            "content_layout": "vertical",
            "height_ratios": [0.1, 0.9],
            "notes": "Receptor-ligand analysis requires a database providing the receptor-ligand gene pairs. The table shows an excerpt from the database selected for this analysis."
        },
        {
            "title": "Interaction overview",
            "content": "../report/0A1_receptor_ligand/02*",
            "notes": [
                "The Cyclone plot shows the amount of interactions between groups, e.g. cell types. An interaction is counted if the receptor- and ligand-gene expression are enriched for the respective groups. The inner-grey shell provides the number of cells per group, while the outermost shell shows the top receptor and ligand genes for the respective group."
            ]
        },
        {
            "title": "Interaction detail",
            "content": "../report/0A1_receptor_ligand/03*",
            "notes": [
                "The Connection plot provides a qualitative view on individual interactions. It shows the percentage of cell expressing the gene (dot size), the gene enrichment for the respective group (dot color) and interaction strength (line width).",
                "The higher each value the better."
            ]
        }
    ],
    "0A2_receptor_ligand_differences": [
        {
            "title": "Pairwise Differences",
            "notes": [
                "The network shows the top interaction differences, using quantile rank differences, between two conditions. Blue lines indicate increased interaction in one condition, while red lines indicate increased interaction in the other (the closer to 1 or -1 the bigger the difference).",
                "The node color and shape define the cell type(s) involved in the interaction. Hubs are centered around a receptor or ligand gene that interacts with multiple partners/ cell types."
            ]
        },
        {
            "title": "Temporal Progression",
            "notes": [
                "These plots focus on the differences between continuous conditions, e.g. day 1 <-> day 2, day 2 <-> day 3. They are the same as the pairwise difference plots, apart from their focus on specific combinations.",
                "The network shows the top interaction differences, using quantile rank differences, between two conditions. Blue lines indicate increased interaction in one condition, while red lines indicate increased interaction in the other (the closer to 1 or -1 the bigger the difference).",
                "The node color and shape define the cell type(s) involved in the interaction. Hubs are centered around a receptor or ligand gene that interacts with multiple partners/ cell types."
            ]
        },
        {
            "title": "Interaction over Time",
            "notes": "This plot shows how the expression of receptor and ligand genes changes over time in specific cell types."
        }
    ],
    "0B_velocity": [
        {
            "title": "Splicing Rate",
            "notes": [
                "The plot shows the rate between spliced and unspliced RNA.",
                "Depending on the sequencing method the rate is typically between 10-25% unspliced.",
                "https://scvelo.readthedocs.io/en/stable/VelocityBasics.html#Load-the-Data"
            ]
        },
        {
            "title": "Velocity Confidence",
            "content_layout": "vertical",
            "notes": [
                "The velocity length describes the differentiation speed of a cell, while the confidence describes the coherence of the velocity vectors with neighboring cells, i.e. if neighboring cells differentiate in the same direction.",
                "https://scvelo.readthedocs.io/en/stable/VelocityBasics.html#Speed-and-coherence"
            ]
        },
        {
            "title": "RNA Velocity",
            "notes": "The arrows on top of the embedding provide a view on the differential/ developmental path the different cell population may follow over time."
        }
    ],
    "group_markers": [
        {
            "title": "Top Marker Genes",
            "content_layout": "vertical",
            "notes": [
                "The top marker genes for each group, e.g. a clustering or cell types, presented in heatmap and dotplot style.",
                "Good quality marker genes are exclusively expressed in one group, meaning they show up in only one row of the plots.",
                "A dataset with groups that are well separated by gene markers shows a 'stair-like' pattern in the plots.",
                "Groups with shared markers might be combined to provide better separated groups."
            ]
        },
        {
            "title": "Top Marker Expression per Group I",
            "notes": [
                "The top marker for each group (row). The first embedding shows the location of the group and the following plots show the respective gene expression binned into hexagonal tiles."
            ]
        },
        {
            "title": "Top Marker Expression per Group II",
            "notes": [
                "The top marker for each group (row). The first embedding shows the location of the group and the following plots show which cells express the gene."
            ]
        },
        {
            "title": "Genes of Interest I",
            "notes": "The mean expression of manually selected genes and a separate embedding showing the expression of each gene summarized into hexagonal tiles."
        },
        {
            "title": "Genes of Interest II",
            "notes": [
                "The expression of the manually selected genes next to each group (row).",
                "The first embedding shows the location of the group and the following plots show the respective gene expression binned into hexagonal tiles."
            ]
        },
        {
            "title": "Genes of Interest III",
            "notes": [
                "The expression of the manually selected genes next to each group (row).",
                "The first embedding shows the location of the group and the following plots show which cells express the gene."
            ]
        },
        {
            "title": "Condition Markers",
            "notes": [
                "This slide shows marker genes that separate the condition within a group. A dotplot is shown for each group where condition markers are found."
            ]
        },
    ],
    "annotation": [
        {
            "title": "Cell Type Annotation",
            "content": [
                "../report/annotation/01_marker_genes_*",
                "../report/annotation/01_Comparison_of_cell_type_annotations.png"
            ],
            "content_layout": "vertical",
            "notes": [
                "The dotplot shows the groups to assign cell types and their respective top gene markers.",
                "The table shows different annotation versions (columns) based on different algorithms. The Annotation is either based on SCSA or on the SC-Framework internal marker repo (MR)."
            ]
        },
        {
            "title": "Cell Type Annotation",
            "notes": [
                "The first embedding is colored for the initial groups, while the following embeddings show the different annotation versions from the previous slides."
            ]
        }
    ],
    "GSEA": [
        {
            "title": "Top Pathways per Group",
            "notes": [
                "The dotplot shows the top enriched pathways per cluster. The size of the dot indicates the fraction of genes in the cluster that match the pathway and the color of the dot indicates statistical significance (higher is better)."
            ]
        },
        {
            "title": "Pathway related Genes",
            "notes": "The term dotplot focuses on a single term/pathway and thus shows individual genes instead of pathways on the y-axis. A Z-Score is applied to the mean gene expression per cluster to highlight differences in expression between the clusters (x-axis)."
        },
        {
            "title": "Pathway Gene-sharing Network",
            "notes": "The network plot shows connections between enriched pathways per cluster. The node size corresponds to the percentage of gene overlap in a certain term of interest. The colour of the node corresponds to the significance of the enriched terms and the edge size corresponds to the number of genes that overlap between two connected nodes."
        }
    ],
    "proportion_analysis": [
        {
            "title": "Proportion Analysis",
            "notes": [
                "The table shows the proportions for each group of the dataset (rows) split by the conditions.",
                "\n",
                "Column descriptions:",
                "baseline_props:	The proportion of cells belonging to the respective group.",
                "mean_props_*:		The proportion of cells belonging to the respective group and condition.",
                "t_/f_statistics:		t-test or ANOVA statistical significance. Higher means bigger difference between conditions (higher is better).",
                "p_values:		Significance of proportional changes between conditions.",
                "adjusted_p_values:	Significance corrected for the number of groups."
            ]
        },
        {
            "title": "Proportion Analysis",
            "notes": [
                "The plots show the proportion (amount) of cells of each group (e.g. cell type) allocated to each of the conditions. The p-value on top of each plot describes whether there is a significant change in proportion between any of the conditions. In case there are no replicates Scanpro will create simulated replicates (similar to random subsamples) to improve statistical robustness. The replicates are either shown as separate entities (upper plot) or as a box-distribution (lower plot). E.g. for a dataset where clustering_col = 'celltype' and condition_col = 'injury' a plot with low p-value can be interpreted as 'Cell Type X shows a high change in the number of cells between injured and healthy' and a high p-value can be interpreted as 'Cell Type Y shows a low change in the number of cells between injured and healthy'."
            ]
        },
        {
            "title": "Proportion Analysis",
            "notes": [
                "The plots show the proportion (amount) of cells of each group (e.g. cell type) allocated to each of the conditions. The p-value on top of each plot describes whether there is a significant change in proportion between any of the conditions. In case there are no replicates Scanpro will create simulated replicates (similar to random subsamples) to improve statistical robustness. The replicates are either shown as separate entities (upper plot) or as a box-distribution (lower plot). E.g. for a dataset where clustering_col = 'celltype' and condition_col = 'injury' a plot with low p-value can be interpreted as 'Cell Type X shows a high change in the number of cells between injured and healthy' and a high p-value can be interpreted as 'Cell Type Y shows a low change in the number of cells between injured and healthy'."
            ]
        }
    ],
    "pseudotime_analysis": [
        {
            "title": "Trajectory and Root Selection",
            "notes": [
                "A principal graph is inferred from the embedding. Cells are assigned to a node by a value between 0 and 1, which allows the use of probabilistic mappings to account for variability. In case of the clustering showing cell types this can be interpreted as potential differentiation paths a cell could take. However, they are not yet in a pseudotime context so there is no direction in the presented graph.",
                "The pseudotime start point is either defined as one of the nodes or the expression strength and density of a gene."
            ]
        },
        {
            "title": "Pseudotime Trajectory I",
            "notes": "The same graph as before but with the directions defined through the previously selected root."
        },
        {
            "title": "Pseudotime Trajectory II",
            "notes": [
                "1. Segments are edges (or branches) in the trajectory tree or graph. Each segment represents a developmental transition between two milestones (nodes), corresponding to cells that are transitioning between states.",
                "2. The milestones are the nodes of the trajectory tree. Each milestone represents a key point or “state” during the differentiation process.",
                "3. The pseudotime values defines potential cell trajectories based on the root as starting points. The grey area is the starting location. The time progresses from light to dark red.",
                "4. The grouping is shown as a reference to the other plots to identify e.g. potential biological differentiation paths."
            ]
        },
        {
            "title": "Pseudotime Trajectory III",
            "notes": [
                "The dendrogram shows the branching of the segments as depicted in the previous slide.",
                "The dendrogram is colored by the selected grouping to e.g. show the emergence of cell types over pseudotime."
            ]
        }
    ]
}

______

## 4 - Generate the report
This part combines all the information and settings from above into the final PowerPoint structure.

In [None]:
report = sc_report.generate_report(
    dataset_name=dataset_name,
    section_titles=section_titles,
    slide_sec_kwargs=slide_section_args,
    file_ext=file_ext,
    template=template,
    slide_format=slide_size,
    report_dir=report_dir,
    max_pixels=max_pixels,
    method_template="RNA"
)

______

## 5 - Save Report
Render the PowerPoint report and save it to the disk.

In [None]:
# Save presentation
out_path = Path(report_dir) / "report.pptx"

report.save(str(out_path))

display(f"Saved to: {out_path}")