``` conda install -c conda-forge scanpy python-igraph leidenalg ```

In [None]:
## Scanpy first analysis
# Load libraries and settings
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import scvelo as scv
sc.settings.verbosity = 3
sc.logging.print_header()
sc.settings.set_figure_params(dpi=100, facecolor='white')

In [None]:
# Define the results file
results_file = "time_course.h5ad"

In [None]:
# Load all the loom files made with cellranger
samples = {
    "day0": "../cellranger/data/10x/1_p2_day0_RNA/outs/filtered_feature_bc_matrix.h5",
    "day1": "../cellranger/data/10x/2_p2_day1_RNA/outs/filtered_feature_bc_matrix.h5",
    "day2": "../cellranger/data/10x/3_p2_day2_RNA/outs/filtered_feature_bc_matrix.h5",
    "day3": "../cellranger/data/10x/4_p2_day3_RNA/outs/filtered_feature_bc_matrix.h5",
}
adatas = {}

for sample_id, filename in samples.items():
    #path = EXAMPLE_DATA.fetch(filename)
    sample_adata = sc.read_10x_h5(filename)
    sample_adata.var_names_make_unique()
    adatas[sample_id] = sample_adata

adata = ad.concat(adatas, label="sample")
adata.obs_names_make_unique()
print(adata.obs["sample"].value_counts())
adata


The data contains ~15,000 cells per sample and 36k measured genes. We’ll now investigate these with a basic preprocessing and clustering workflow.




# Quality Control

The scanpy function calculate_qc_metrics() calculates common quality control (QC) metrics, which are largely based on calculateQCMetrics from scater [McCarthy et al., 2017]. One can pass specific gene population to calculate_qc_metrics() in order to calculate proportions of counts for these populations. Mitochondrial, ribosomal and hemoglobin genes are defined by distinct prefixes as listed below.

In [None]:
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")

In [None]:
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
)

One can now inspect violin plots of some of the computed QC metrics:

the number of genes expressed in the count matrix
the total counts per cell
the percentage of counts in mitochondrial genes

In [None]:
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)

In [None]:
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")

Based on the QC metric plots, one could now remove cells that have too many mitochondrial genes expressed or too many total counts by setting manual or automatic thresholds. However, sometimes what appears to be poor QC metrics can be driven by real biology so we suggest starting with a very permissive filtering strategy and revisiting it at a later point. We therefore now only filter cells with less than 100 genes expressed and genes that are detected in less than 3 cells.

Additionally, it is important to note that for datasets with multiple batches, quality control should be performed for each sample individually as quality control thresholds can very substantially between batches.

In [None]:
sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=3)

# Doublet detection

As a next step, we run a doublet detection algorithm. Identifying doublets is crucial as they can lead to misclassifications or distortions in downstream analysis steps. Scanpy contains the doublet detection method Scrublet [Wolock et al., 2019]. Scrublet predicts cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets. scanpy.pp.scrublet() adds doublet_score and predicted_doublet to .obs. One can now either filter directly on predicted_doublet or use the doublet_score later during clustering to filter clusters with high doublet scores.

In [None]:
import scanpy as sc
import scanpy.external as sce

# Run Scrublet
sce.pp.scrublet(adata, batch_key="sample")


# Normalization

The next preprocessing step is normalization. A common approach is count depth scaling with subsequent log plus one (log1p) transformation. Count depth scaling normalizes the data to a “size factor” such as the median count depth in the dataset, ten thousand (CP10k) or one million (CPM, counts per million). The size factor for count depth scaling can be controlled via target_sum in pp.normalize_total. We are applying median count depth normalization with log1p transformation (AKA log1PF).

In [None]:
# Saving count data
adata.layers["counts"] = adata.X.copy()

In [None]:
# Normalizing to median total counts
sc.pp.normalize_total(adata)
# Logarithmize the data
sc.pp.log1p(adata)

# Feature selection

As a next step, we want to reduce the dimensionality of the dataset and only include the most informative genes. This step is commonly known as feature selection. The scanpy function pp.highly_variable_genes annotates highly variable genes by reproducing the implementations of Seurat [Satija et al., 2015], Cell Ranger [Zheng et al., 2017], and Seurat v3 [Stuart et al., 2019] depending on the chosen flavor.

In [None]:
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="sample")

In [None]:
sc.pl.highly_variable_genes(adata)

# Dimensionality Reduction

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.



In [None]:
sc.tl.pca(adata)

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function leiden() or tsne(). In our experience, there does not seem to be signifigant downside to overestimating the numer of principal components.

In [None]:
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)

You can also plot the principal components to see if there are any potentially undesired features (e.g. batch, QC metrics) driving signifigant variation in this dataset. In this case, there isn’t anything too alarming, but it’s a good idea to explore this.



In [None]:
sc.pl.pca(
    adata,
    color=["sample", "sample", "pct_counts_mt", "pct_counts_mt"],
    dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
    ncols=2,
    size=2,
)

# Nearest neighbor graph constuction and visualization

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix.

In [None]:
sc.pp.neighbors(adata)

This graph can then be embedded in two dimensions for visualiztion with UMAP (McInnes et al., 2018):

In [None]:
sc.tl.umap(adata)

We can now visualize the UMAP according to the sample.

In [None]:
sc.pl.umap(
    adata,
    color="sample",
    # Setting a smaller point size to get prevent overlap
    size=2,show=True
)

In [None]:
#%matplotlib inline

We observe a major batch effect between day0 and the other three days. However, this could reflect real biological changes that have taken place quickly between day0 and day1. We will continue with clustering and annotation of our data, while we could inspect batch effects in UMAP to try and integrate across samples and perform batch correction/integration. We could use harmony.

# Clustering

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) [Traag et al., 2019]. Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [None]:
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
sc.tl.leiden(adata, flavor="igraph", n_iterations=2)

In [None]:
sc.pl.umap(adata, color=["leiden"])

# Re-assess quality control and cell filtering

As indicated before, we will now re-assess our filtering strategy by visualizing different QC metrics using UMAP.

In [None]:
sc.pl.umap(
    adata,
    color=["leiden", "predicted_doublet", "doublet_score"],
    # increase horizontal space between panels
    wspace=0.5,
    size=3,
)

In [None]:
sc.pl.umap(
    adata,
    color=["leiden", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts"],
    wspace=0.5,
    ncols=2,
)

It is quite clear that I should have the mito above a certain thresold removed. There are many cells mainly of day3 in the cluster number 8 that have a low QC also marked by the number of genes and total counts. Therefore I should apply new filters.

In [None]:
adata = adata[adata.obs["pct_counts_mt"] < 20].copy()
print(f"#cells after MT filter: {adata.n_obs}")

In [None]:
# Saving count data
adata.layers["counts"] = adata.X.copy()
#sc.pp.normalize_total(adata, inplace=False)
#sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="sample")
sc.pl.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
sc.pl.pca(
    adata,
    color=["sample", "sample", "pct_counts_mt", "pct_counts_mt"],
    dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
    ncols=2,
    size=2,
)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

In [None]:
sc.pl.umap(
    adata,
    color="sample",
    # Setting a smaller point size to get prevent overlap
    size=2,
)

In [None]:
sc.pl.umap(
    adata,
    color=["leiden", "predicted_doublet", "doublet_score"],
    # increase horizontal space between panels
    wspace=0.5,
    size=3,
)

In [None]:
sc.pl.umap(
    adata,
    color=["leiden", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts"],
    wspace=0.5,
    ncols=2,
)

# Manual cell-type annotation

## Note
This section of the tutorial is expanded upon using prior knowledge resources like automated assignment and gene enrichment in the scverse tutorial here
Cell type annotation is laborous and repetitive task, one which typically requires multiple rounds of subclustering and re-annotation. It’s difficult to show the entirety of the process in this tutorial, but we aim to show how the tools scanpy provides assist in this process.

We have now reached a point where we have obtained a set of cells with decent quality, and we can proceed to their annotation to known cell types. Typically, this is done using genes that are exclusively expressed by a given cell type, or in other words these genes are the marker genes of the cell types, and are thus used to distinguish the heterogeneous groups of cells in our data. Previous efforts have collected and curated various marker genes into available resources, such as CellMarker, TF-Marker, and PanglaoDB. The cellxgene gene expression tool can also be quite useful to see which cell types a gene has been expressed in across many existing datasets.

Commonly and classically, cell type annotation uses those marker genes subsequent to the grouping of the cells into clusters. So, let’s generate a set of clustering solutions which we can then use to annotate our cell types. Here, we will use the Leiden clustering algorithm which will extract cell communities from our nearest neighbours graph.

In [None]:
for res in [0.02, 0.5, 2.0]:
    sc.tl.leiden(
        adata, key_added=f"leiden_res_{res:4.2f}", resolution=res, flavor="igraph"
    )

Notably, the number of clusters that we define is largely arbitrary, and so is the resolution parameter that we use to control for it. As such, the number of clusters is ultimately bound to the stable and biologically-meaningful groups that we can ultimately distringuish, typically done by experts in the corresponding field or by using expert-curated prior knowledge in the form of markers.



In [None]:
sc.pl.umap(
    adata,
    color=["leiden_res_0.02", "leiden_res_0.50", "leiden_res_2.00"],
    legend_loc="on data",
)

Though UMAPs should not be over-interpreted, here we can already see that in the highest resolution our data is over-clustered, while the lowest resolution is likely grouping cells which belong to distinct cell identities.



# Marker gene set

Let’s define a set of marker genes for the main cell types that we expect to see in this dataset. These were adapted from Single Cell Best Practices annotation chapter, for a more detailed overview and best practices in cell type annotation, we refer the user to it.

In [None]:
marker_genes = {
    "Pop2": ["CADM2", "ARFGEF3","SHD"],
    "Pop1": ["CRYAB", "RGS6", "PHLDA2"],
    "Pop3": ["PCSK2", "GABRQ", "NSG1"],
    "Pop4": ["SLC38A1", "SLC40A1", "RARRES2"]
}

In [None]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden_res_0.02", standard_scale="var")

In [None]:
sc.pl.dotplot(adata, marker_genes, groupby="leiden_res_0.50", standard_scale="var")

It seems that clusters from day 0 [0,1,2] have higher expression of the three markers of Pop4, as seen by bulk RNASeq but it is not very clear. I'd rather obtain the Differential Markers before, then I'll try also to harmonize the batch effect.
I first calculate markers from the lowest resolution 'leiden_res_0.02'

In [None]:
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res_0.02", method="wilcoxon")

We can then visualize the top 25 differentially-expressed genes on a dotplot.

In [None]:
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res_0.02", standard_scale="var", n_genes=25
)

The genes GAS5, MALAT1 and the other ribo and mito genes suggest a strong batch effect between Sample at day0 and the other three, therefore I should run batch effect correction before the analysis.

# Samples integration with Harmony

In [None]:
# I try harmony
import scanpy.external as sce
import harmonypy
# adata_harm = adata.copy()
sce.pp.harmony_integrate(adata, 'sample')

In [None]:
sc.pp.neighbors(adata, use_rep = 'X_pca_harmony')
sc.tl.umap(adata)

In [None]:
for res in [0.02, 0.5, 2.0]:
    sc.tl.leiden(
        adata, key_added=f"leiden_res_{res:4.2f}", resolution=res, flavor="igraph"
    )

In [None]:
sc.pl.umap(
    adata,
    color=["leiden_res_0.02", "leiden_res_0.50", "leiden_res_2.00"],
    legend_loc="on data",
)

In [None]:
sc.pl.umap(
    adata,
    color=["sample"],
    legend_loc="on data",
)

Now I start again with the Markers

In [None]:
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res_0.50", method="wilcoxon")

In [None]:
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res_0.50", standard_scale="var", n_genes=5
)

Cluster 6 is again low quality since MALAT1, but also KCNQ1OT1 is a lncRNA associated to low QC. However it might still be a bit of overclustering. I reduce the resolution to 0.3

In [None]:
sc.tl.leiden(adata, key_added="leiden_res_0.30", resolution=0.30, flavor="igraph")

In [None]:
sc.pl.umap(
    adata,
    color=["leiden_res_0.30"],
    legend_loc="on data",
)

In [None]:
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res_0.30", method="wilcoxon")
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res_0.30", standard_scale="var", n_genes=5
)

# Cycle scoring

Load cell cycle genes defined in Tirosh et al, 2015. It is a list of 97 genes, represented by their gene symbol. The list here is for humans, in case of alternate organism, a list of ortologues should be compiled. There are major differences in the way Scanpy and Seurat manage data, in particular we need to filter out cell cycle genes that are not present in our dataset to avoid errors.

In [None]:
cell_cycle_genes = [x.strip() for x in open('./regev_lab_cell_cycle_genes.txt')]

Here we define two lists, genes associated to the S phase and genes associated to the G2M phase

In [None]:
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]
cell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]

Standard filters applied. Note that we do not extract variable genes and work on the whole dataset, instead. This is because, for this demo, almost 70 cell cycle genes would not be scored as variable. Cell cycle scoring on ~20 genes is ineffective.

We here perform cell cycle scoring. The function is actually a wrapper to sc.tl.score_gene_list, which is launched twice, to score separately S and G2M phases. Both sc.tl.score_gene_list and sc.tl.score_cell_cycle_genes are a port from Seurat and are supposed to work in a very similar way. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.

In [None]:
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)

In [None]:
sc.pl.umap(
    adata,
    color=["phase","leiden_res_0.30","sample"],
)

The clusters green/blue (0 and 2) are mainly G2M and S cell cycle. In day1 we have more cells in those clusters. We can see this with a violin plot

In [None]:
sc.pl.violin(
    adata,["CD44","CD24","EGFR", "S_score","G2M_score"],groupby="sample",jitter=True, scale='width', log=False, rotation=45, stripplot=True, multi_panel=True
)

We also see that CD44 and CD24 high are highest at day0 while they decrease already after day1 and then they keep rather stable along the day2 and 3

# Signatures from the RNAseq bulk of P1-4 populations

I calculate the different signatures from the csv file with the genes in the 4 populations. There's two different versions of population 1 genes

In [None]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv('Pop_GeneLists2.csv')

# Group the genes based on the signature they belong to
# Assuming the signature names are unique in the 'List' column
signatures = df.groupby('List')['Name'].apply(list).to_dict()

# You will now have a dictionary where the keys are the signature names
# and the values are the corresponding lists of genes


Now the scores

In [None]:
# Calculate scores for each signature
for signature_name, gene_list in signatures.items():
    # Calculate the score for each gene signature
    sc.tl.score_genes(adata, gene_list, score_name=signature_name + '_score')

# Now the scores will be stored in adata.obs with column names like 'Signature1_score', 'Signature2_score', etc.


In [None]:
# Visualize the signature scores on UMAP
sc.pl.umap(adata, color=[signature_name + '_score' for signature_name in signatures.keys()],size=100, cmap='RdYlBu_r')


It seems that Pop2 are the basal level of clusters in the left (cluster2 mainly), population 1 is in between those clusters and population 3 is more to the left of cluster2. Population 4 is clearly cluster 3. Population 2 could also be the cluster up there a bit unconnected, though.

In [None]:
import matplotlib.pyplot as plt
# List of signature score columns to plot
signature_scores = [signature_name + '_score' for signature_name in signatures.keys()]

# Plot violin plots without showing the points
sc.pl.violin(adata, signature_scores, groupby='sample', jitter=False, scale='width', rotation=45, stripplot=False)


Not very clear, however, Population 2 (the original one) is still the majoritary. Pop3 is increasing with days, while Pop1 seems to be increasing just at day1. Pop4 few cells.

I increase the resolution to 0.7 to increase granularity

In [None]:
sc.tl.leiden(adata, key_added="leiden_res_0.70", resolution=0.70, flavor="igraph")
sc.pl.umap(
    adata,
    color=["leiden_res_0.70"],
    legend_loc="on data",
)

I decide to keep this resolution as definitive and then I name them 'clusters'

In [None]:
import scanpy as sc

# Assuming 'adata' is your AnnData object
adata.obs['clusters'] = adata.obs['leiden_res_0.70']


# Barplot of percentages of clusters

In [None]:
# Barplot of percentages of clusters

In [None]:
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt

# Assuming your adata object is named 'adata'
# Extract cluster and sample information
clusters = adata.obs['clusters']
samples = adata.obs['sample']

# Create a DataFrame with the cluster and sample information
df = pd.DataFrame({'cluster': clusters, 'sample': samples})

# Calculate the percentages of cells in each cluster for each sample
percentage_df = df.groupby(['sample', 'cluster']).size().unstack(fill_value=0)
percentage_df = percentage_df.div(percentage_df.sum(axis=1), axis=0) * 100

# Get the colors from the UMAP plot
umap_colors = adata.uns['leiden_res_0.70_colors']

# Plot the barplot
percentage_df.plot(kind='bar', stacked=True, color=umap_colors)
plt.xlabel('Sample')
plt.ylabel('Percentage of Cells')
plt.title('Percentage of Cells in Each Leiden Cluster for Each Sample')
plt.legend(title='Leiden Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
sc.pl.umap(adata,color="clusters", legend_loc='on data')

Cluster 8 (light blue) increases after day1, it seems to be cell cycle replication. The same for cluster 5. Cluster 6 (pink) decreases, could be the G1 of Pop2 that gets into cell cycle. Cluster 7 shrinks a bit but it is stable. Same for cluster 4 which seems in the middle. Also cluster 3 decreases at day 1 and then is stable. Cluster 2, which is a bit apart, same.

I obtain the markers for those clusters

In [None]:
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res_0.70", method="wilcoxon")
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res_0.70", standard_scale="var", n_genes=5
)

In [None]:
# Write an excel file with the genes
import pandas as pd

# Extract the results from the rank_genes_groups function
result = adata.uns['rank_genes_groups']

# Create a DataFrame to store the results
groups = result['names'].dtype.names
df = pd.DataFrame(
    {group + '_' + key: result[key][group]
     for group in groups for key in result.keys() if key != 'params'})

# Save the DataFrame to an Excel file
df.to_excel('markers_statistics.xlsx', index=False)

print("The markers and their statistics have been successfully saved to markers_statistics.xlsx")


Cluster 1 is cell cycle
Cluster 5 is cell cycle G2/M mainly, too many histones
Cluster 8 is S to G2 (RRM2 = ribonucleotide reductase regulatory subunit + HISTH1B)
Cluster 9 (apart) is interesting, MALAT1 and KCNQ1OT1 are two lncRNAs. I've seen KCNQ1OT1 associated to low qc thus I'm not sure. We might have to remove this cluster, which is also rather unconnected. 
Cluster 7 is surely a different state
Cluster 7 is also a state
Cluster 6 again KCNQ1OT1 but less and others... too big to be low qc.
Cluster 0 regulating cell cycle progression? CDKN3 regulates (stops) G1 to S
Cluster 3 Ribosomal?
Cluster 4 Very interesting markers

In [None]:
adata.write("Time_course_uncorrected_ccc_preprocessed_all_clusters_in.h5ad")

# Module score

I try now to calculate the module score of the different modules in glioblastoma

In [None]:
import pandas as pd

# Load the tab-delimited file into a pandas DataFrame
file_path = 'metamodules.txt'  # Replace with the correct file path
modules_df = pd.read_csv(file_path, delimiter='\t', index_col=0)

# Convert the DataFrame into a dictionary of lists
# Each column represents a module, and each row represents genes
modules_dict = {col: modules_df[col].dropna().tolist() for col in modules_df.columns}

# Check the module dictionary (optional)
print(modules_dict)


In [None]:
import scanpy as sc

# Calculate the scores for each module
for module_name, gene_list in modules_dict.items():
    sc.tl.score_genes(adata, gene_list, score_name=module_name)

# Check the adata.obs to see the added module scores (optional)
print(adata.obs.head())


In [None]:
import scanpy as sc
import matplotlib.pyplot as plt

# List of module names (replace with actual names from your data)
module_names = list(modules_dict.keys())

# Create subplots: 1 row, N columns (side-by-side)
fig, axes = plt.subplots(1, len(module_names), figsize=(5 * len(module_names), 5))

# Plot each UMAP in a separate subplot
for i, module_name in enumerate(module_names):
    sc.pl.umap(adata, color=module_name, cmap='RdYlBu_r', size=40, ax=axes[i], show=False)
    axes[i].set_title(f'UMAP: {module_name}')

# Show the plots
plt.tight_layout()
plt.show()


Cluster 9 nothing = I'll remove it
Cluster 4 AC mainly
Cluster 6 OPC/NPC
Cluster 7 NPC getting transfored (Cluster 6 seems NPC1)
Cluster 0 to Cluster 1 NPC/AC cells going into cell cycle G2M =>
Cluster 1 G2/M
Cluster 8 OPC going to G1/S =>
Cluster 5
Cluster 2 NPC1
Cluster 3 NPC1?

In [None]:
sc.pl.umap(adata,color="clusters", legend_loc='on data')

NPC1 and NPC2 are similar but increaing to the left of the cluster for NPC2. G1/S and G2/M are cleary to the right in this uncorrected dataset.
MES1 is in the midle but undistinguishable, more ore less, with AC and OPC in a way. Could be that those in the middle are the plastic ones?

In [None]:
import scanpy as sc
import matplotlib.pyplot as plt

# List of module names (replace with actual names from your data)
module_names = list(modules_dict.keys())

# Create subplots: 1 row, N columns (side-by-side)
fig, axes = plt.subplots(1, len(module_names), figsize=(5 * len(module_names), 5))

# Plot each violin in a separate subplot, grouped by 'sample'
for i, module_name in enumerate(module_names):
    sc.pl.violin(adata, keys=module_name, groupby='sample', ax=axes[i], show=False, jitter=False, stripplot=False)
    axes[i].set_title(f'Violin: {module_name}')

# Show the plots
plt.tight_layout()
plt.show()


MES1 seems to decrease globally with time while AC, OPC and NPC1 incresases. Cell cycle stimulated at day1

In [None]:
import scanpy as sc
import matplotlib.pyplot as plt

# Define the list of samples and modules (replace with actual names)
samples = adata.obs['sample'].unique()  # Assuming 'sample' column holds sample names
module_names = list(modules_dict.keys())

# Create a grid of subplots: 4 rows (for 4 samples), len(module_names) columns
fig, axes = plt.subplots(len(samples), len(module_names), figsize=(5 * len(module_names), 5 * len(samples)))

# Loop over each sample and each module to create UMAP plots
for i, sample in enumerate(samples):
    # Subset the data for the current sample
    sample_adata = adata[adata.obs['sample'] == sample]
    
    for j, module_name in enumerate(module_names):
        # Create a UMAP plot for the current sample and module
        sc.pl.umap(sample_adata, color=module_name, ax=axes[i, j], cmap='RdYlBu_r', size=40, show=False)
        axes[i, j].set_title(f'Sample: {sample}, Module: {module_name}')

# Adjust layout and display
plt.tight_layout()
plt.show()


I have to rename the clusters now

# Manual annotation (aided by CoPilot)
I give copilot the information about the markers (score > 50, log2FC > 0.5 and FDR < 0.05) and the levels of the modules from glioblastoma and I obtained a proposal for the naming of the cluster and some genes supporting it based on literature

## Cluster 0
Score > 10; FDR < 0.05; log2FC > 0.5

Based on the list of genes provided, several of them are associated with cell cycle regulation, proliferation, and glioblastoma biology. Here are some key genes and their roles that support the suggested cluster names:

**PTTG1** (Pituitary Tumor-Transforming Gene 1): This gene is involved in the regulation of mitosis and is often overexpressed in various tumors, including glioblastoma1. It supports the idea of a Progenitor-Enriched Cluster due to its role in cell proliferation.

**CDKN3** (Cyclin-Dependent Kinase Inhibitor 3): CDKN3 acts as a negative regulator of the cell cycle, which is crucial for maintaining the progenitor state2. This aligns with the Progenitor Signature Cluster.

**BIRC5** (Survivin): This gene inhibits apoptosis and promotes cell proliferation, which is common in progenitor cells and cancer3. It supports the Glioblastoma Progenitor Cluster.

**CDC20** (Cell Division Cycle 20): CDC20 is essential for the progression of the cell cycle and is often upregulated in cancer4. This gene supports the NPC/OPC Hybrid Cluster due to its role in cell division.

**VIM** (Vimentin): Vimentin is a marker of mesenchymal cells and is often upregulated in glioblastoma, indicating a more aggressive and progenitor-like state. This supports the Neuro-Oligo Progenitor Cluster.

These genes highlight the proliferative and progenitor-like characteristics of the cluster, making the suggested names biologically relevant.

## Cluster 1
Score > 20; FDR < 0.05; log2FC > 0.5

Given the G2/M glioblastoma signature of your cluster, you could name it “G2/M Proliferative Cluster”. This name reflects the phase of the cell cycle that is highly active in glioblastoma cells.

Here are five genes from your list that support this G2/M signature:

**CCNB1** (Cyclin B1): Essential for the control of the cell cycle at the G2/M (mitosis) transition.
**AURKA** (Aurora Kinase A): Plays a crucial role in cell division by controlling chromosomal segregation.
**PLK1** (Polo-Like Kinase 1): Involved in various stages of mitosis, including spindle formation and cytokinesis.
**CDC20** (Cell Division Cycle 20): Activates the anaphase-promoting complex/cyclosome (APC/C), which is necessary for the transition from metaphase to anaphase.
**MKI67** (Marker of Proliferation Ki-67): A well-known marker for cell proliferation, present during all active phases of the cell cycle (G1, S, G2, and mitosis).
These genes are highly relevant to the G2/M phase and are commonly associated with the proliferative nature of glioblastoma cells12.

## Cluster 2
Score > 10; FDR < 0.05; log2FC > 0.5

Given that Cluster 2 expresses the NPC_1 signature and includes the listed genes, you might consider naming it something that reflects its neural progenitor cell (NPC) characteristics and its commitment to this state. Here are a few suggestions:
Committed NPC Cluster: This name highlights the cluster’s strong commitment to the neural progenitor cell state.
Neural Progenitor Signature Cluster: Emphasizing the high expression of NPC-related genes.
NPC-Enriched Cluster: Reflecting the enrichment of NPC markers.
Neurogenic Progenitor Cluster: Indicating the neurogenic potential of the cells in this cluster.

Here are some key genes from your list that support these names:

**SLC38A1** (Solute Carrier Family 38 Member 1): Involved in amino acid transport, which is crucial for cell growth and differentiation.

**H1F0** (H1 Histone Family Member 0): Associated with chromatin structure and gene regulation, important for progenitor cell function.

**RARRES2** (Retinoic Acid Receptor Responder 2): Plays a role in cell differentiation and proliferation.

**IGFBP5** (Insulin-Like Growth Factor Binding Protein 5): Involved in cell growth and survival, often expressed in progenitor cells.

**SOX4** (SRY-Box Transcription Factor 4): A transcription factor important for neural progenitor cell development and differentiation.

These genes highlight the neural progenitor characteristics of the cluster, making the suggested names biologically relevant.

## Cluster 3

Here are five non-ribosomal genes from your list that support this name:

**SERPINF1** (Serpin Family F Member 1): Known for its role in inhibiting angiogenesis and is often involved in tumor progression.

**GAS5** (Growth Arrest Specific 5): A non-coding RNA that regulates cell growth and apoptosis, often implicated in cancer.

**NOP53** (Nucleolar Protein 53): Involved in ribosome biogenesis and cell cycle regulation, playing a role in tumorigenesis.

**TSPO** (Translocator Protein): Associated with mitochondrial function and often upregulated in glioblastoma.

**RACK1** (Receptor for Activated C Kinase 1): Involved in various signaling pathways, including those regulating cell growth and survival.

## Cluster 4 

score > 10; FDR < 0.05; log2FC > 0.5

Given the diverse signatures (OPC, NPC, MES1, and AC) and the central location in the UMAP, this cluster could be indicative of a highly plastic and multipotent cell population. You might consider naming it “Multipotent Progenitor Cluster” to reflect its diverse potential and central role in pseudotime trajectories.

Here are five genes from your list that support this multipotent nature:

**HES5** (Hes Family BHLH Transcription Factor 5): Involved in the Notch signaling pathway, crucial for maintaining progenitor cell states.

**SPARCL1** (Secreted Protein Acidic and Rich in Cysteine-Like 1): Plays a role in cell-matrix interactions and is often upregulated in glioblastoma.

**FABP7** (Fatty Acid Binding Protein 7): Linked to neural stem cells and glioblastoma stem-like cells.

**IGFBP2** (Insulin-Like Growth Factor Binding Protein 2): Associated with glioblastoma progression and progenitor cell characteristics.

**VIM** (Vimentin): A marker for mesenchymal cells, indicating a mesenchymal-like state.
These genes highlight the cluster’s potential for differentiation into various cell types, which is characteristic of progenitor cells12.

# Cluster 5

Here are five genes from your list that support this G1/S signature:

**TOP2A** (Topoisomerase II Alpha): Essential for DNA replication and is highly expressed during the G1/S phase.

**CDK1** (Cyclin-Dependent Kinase 1): Plays a crucial role in the control of the cell cycle, particularly in the transition from G1 to S phase.

**MKI67** (Marker of Proliferation Ki-67): A well-known marker for cell proliferation, present during all active phases of the cell cycle.

**AURKB** (Aurora Kinase B): Involved in chromosome segregation and cytokinesis, crucial for cell division.

**RRM2** (Ribonucleotide Reductase Regulatory Subunit M2): Plays a key role in DNA synthesis and repair, particularly during the S phase.

These genes are indicative of the G1/S phase and are commonly associated with the proliferative nature of glioblastoma cells12.

## Cluster 6

score > 20; FDR < 0.05

Given the high expression of OPC (oligodendrocyte progenitor cells) and NPC1 (neural progenitor cells) signatures, and its proximity to cluster 3 in the UMAP, you could name this cluster “Progenitor Cell Cluster 2” to reflect its progenitor cell characteristics.

Here are five non-ribosomal genes from your list that support this name:

**SOX11** (SRY-Box Transcription Factor 11): Involved in neural development and often expressed in progenitor cells.

**DCX** (Doublecortin): A marker for neural progenitor cells and involved in neuronal migration.

**PTPRZ1** (Protein Tyrosine Phosphatase Receptor Type Z1): Associated with neural stem cells and glioblastoma.

**CCND2** (Cyclin D2): Plays a role in cell cycle regulation, particularly in progenitor cells.

**SOX4** (SRY-Box Transcription Factor 4): Involved in the regulation of embryonic development and cell fate decisions.

These genes highlight the cluster’s potential for differentiation into various cell types, which is characteristic of progenitor cells12.

## Cluster 7

score > 10 ; FDR < 0.05; log2FC > 0.5

Given the high expression of the NPC_2 (neural progenitor cell) signature and its location at the extreme end of the UMAP, suggesting a more committed state, you could name this cluster “Committed Neural Progenitor Cluster”.

Here are five genes from your list that support this name:

**DLX2** (Distal-Less Homeobox 2): Involved in neural development and differentiation.

**DCX** (Doublecortin): A marker for neural progenitor cells and involved in neuronal migration.

**SOX4** (SRY-Box Transcription Factor 4): Plays a role in the regulation of embryonic development and cell fate decisions.

**INSM1** (Insulinoma-Associated 1): A transcription factor involved in neurogenesis.

**PBX1** (Pre-B-Cell Leukemia Homeobox 1): Involved in the regulation of developmental processes, including neural development.

These genes highlight the cluster’s commitment to neural progenitor cell states and its potential role in glioblastoma biology.

## Cluster 8
score > 20; FDR < 0.05: log2FC > 1

Given the high expression of genes related to the G1/S phase of the cell cycle and DNA replication, you could name this cluster “G1/S Replicative Cluster”. This name reflects the active DNA replication and cell cycle progression characteristic of these cells.

Here are five genes from your list that support this G1/S signature:

**RRM2** (Ribonucleotide Reductase Regulatory Subunit M2): Plays a key role in DNA synthesis and repair, particularly during the S phase.

**TK1** (Thymidine Kinase 1): Involved in DNA synthesis and is a marker for cell proliferation.

**MCM10** (Minichromosome Maintenance Complex Component 10): Essential for the initiation of DNA replication.

**RAD51** (RAD51 Recombinase): Involved in homologous recombination and DNA repair.

**PCNA** (Proliferating Cell Nuclear Antigen): Acts as a processivity factor for DNA polymerase in DNA replication.

These genes highlight the cluster’s involvement in DNA replication and cell cycle progression.

## Cluster 9

It shows less genes and many lncRNAS, such as MALAT1, NEAT1 and KCNQ1OT1


In [None]:
sc.pl.umap(adata, color=["clusters", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts", "predicted_doublet"])

In [None]:
import scanpy as sc

# Assuming you have your AnnData object 'adata_filtered'
# Define a dictionary with the old cluster names as keys and new names as values
new_cluster_names = {
    '0': 'Neural_Progenitors',
    '1': 'G2/M Proliferative',
    '2': 'Committed NPC_1',
    '3': 'NeuroOligo_Progenitors',
    '4': 'Multipotent Progenitors',
    '5': 'G1/S Proliferative',
    '6': 'NeuroOligo_Progenitors_2',
    '7': 'Commited NPC_2',
    '8': 'G1/S Replicative',
    '9': 'lowQC'
}

# Map the new names to the 'leiden' column in the obs DataFrame
adata.obs['clusters_renamed'] = adata.obs['clusters'].map(new_cluster_names)

# Plot the UMAP with the new cluster names
sc.pl.umap(adata, color='clusters_renamed')


In [None]:
# Adjust the font size of the legend
sc.pl.umap(adata, color='clusters_renamed', legend_fontsize=8)

# CellRank

In [None]:
#pip install git+https://github.com/theislab/moscot.git@main


In [None]:
#pip install --upgrade cellrank

In [None]:
#pip install --upgrade scvelo

After updating the file "~/.conda/envs/cellrank2.0/lib/python3.11/site-packages/chex/_src/pytypes.py" in line 53 with:
from typing import Sequence, Any
Shape = Sequence[int | Any]

In [None]:
# Downgrade Jax
#!pip install --user "jax[cuda12_pip]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

In [None]:
#!pip install scanpy==1.9.3
import scanpy as sc
import sys
import cellrank as cr


sc.settings.set_figure_params(frameon=False, dpi=100)
cr.settings.verbosity = 2
if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank

In [None]:
adata.write("Time_course_uncorrected_ccc_preprocessed_all_clusters_in.h5ad")
# adata = sc.read("time_course.h5ad")


Now I filter cluster 9 unconnected before doing pseudotime

In [None]:
import scanpy as sc

# Assuming your adata object is named 'adata'
# Filter out cells that are not in cluster '9'
adata_filtered = adata[adata.obs['clusters_renamed'] != 'lowQC', :]

# Verify the removal
print(adata_filtered.obs['clusters_renamed'].unique())


In [None]:
sc.tl.diffmap(adata_filtered)

In [None]:
import scvelo as scv
root_ixs = 2394  # has been found using `adata.obsm['X_diffmap'][:, 3].argmax()`
scv.pl.scatter(
    adata_filtered,
    basis="diffmap",
    c=["clusters_renamed", root_ixs],
    legend_loc="right",
    components=["1, 2"],
)

adata_filtered.uns["iroot"] = root_ixs

In [None]:
sc.tl.dpt(adata_filtered)
sc.pl.embedding(
    adata_filtered,
    basis="umap",
    color=["dpt_pseudotime"],
    color_map="gnuplot2",
)

In [None]:
#!pip install moscot --user
import moscot
from moscot.problems.time import TemporalProblem

import cellrank as cr
import scanpy as sc
from cellrank.kernels import RealTimeKernel

sc.settings.set_figure_params(frameon=False, dpi=100)
cr.settings.verbosity = 2

In [None]:
import warnings

warnings.simplefilter("ignore", category=UserWarning)

In [None]:
# Create a mapping dictionary
mapping = {"day0": 1, "day1": 2, "day2": 3, "day3": 4}

# Apply the mapping to the 'sample' column
adata_filtered.obs["day_numerical"] = adata_filtered.obs["sample"].map(mapping)


In [None]:
# Compute the force-directed layout
sc.tl.draw_graph(adata_filtered, layout='fa')  # 'fa' stands for ForceAtlas2

# Plot the result
sc.pl.draw_graph(adata_filtered, color='sample')  # Replace 'sample' with your column of interest

In [None]:
sc.pl.embedding(
    adata_filtered,
    basis="X_draw_graph_fr",
    color=["day_numerical", "clusters_renamed"],
    color_map="gnuplot",
)

# Moscot
With moscot, we couple cells across time points using optimal transport (OT), as pioneered by Waddington-OT [Schiebinger et al., 2019]. moscot scales to millions of cells and supports multi-modal data [Klein et al., 2023]. We demonstrate the most basic use-case here: linking a smaller-scale unimodal scRNA-seq dataset across experimental time points.

## Note
moscot can do much more! To learn how to incorporate multimodal information, millions of cells, and additional spatial information, check out the documentation, including many tutorials. Additionally, to include lineage-traced data, check out the moscot-lineage (moslin) tutorial.

Importantly, everything we demonstrate here works exactly the same if you include these additional data modalities! The couplings just get better, and additional downstream analysis becomes available.
The first step is to set up a TemporalProblem. If you have additional spatial or linegae information, you can use the SpatioTemporalProblem or the LineageProblem, respectively.

In [None]:
tp = TemporalProblem(adata_filtered)

Next, we adjust the marginals for cellular growth- and death using score_genes_for_marginals().

In [None]:
tp = tp.score_genes_for_marginals(
    gene_set_proliferation="human", gene_set_apoptosis="human"
)

Visualize the computed proliferation and apoptosis scores in the embedding.

In [None]:
sc.pl.embedding(
    adata_filtered, basis="X_draw_graph_fr", color=["clusters_renamed", "proliferation", "apoptosis"]
)

Following the original Waddington OT publication, we use local PCAs, computed separately for each pair of time points, to calulate distances among cells [Schiebinger et al., 2019]. Accordingly, we prepare the TemporalProblem without passing a joint_attr, this automatically computes local PCAs.

In [None]:
tp = tp.prepare(time_key="day_numerical")

In the final step, we solve one OT problem per time point pair, probabilistically matching early to late cells [Peyré et al., 2019].

In [None]:
# I stop here to work on tomorrow
adata_filtered.write("time_course_cluster_9_out.h5ad")

In [None]:
# I save the tp object with pickle
import pickle

# Save the object to a file
with open('tp.pkl', 'wb') as f:
    pickle.dump(tp, f)




In [None]:
# I start over again today
# Load libraries
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import scvelo as scv
import sys
import cellrank as cr

sc.settings.verbosity = 3
sc.logging.print_header()
sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=False)
cr.settings.verbosity = 2

if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/cellrank

# Load adata_filtered
adata_filtered = sc.read("time_course_cluster_9_out.h5ad")
# Run to load tp
import pickle

# Load the object from the file
with open('tp.pkl', 'rb') as f:
    tp = pickle.load(f)

print(tp)


In [None]:
#!pip install --upgrade jax jaxlib
#!pip install "jax[cuda12_pip]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html


In [None]:
tp = tp.solve(epsilon=1e-3, tau_a=0.95, scale_cost="mean",)


Above, epsilon and tau_a control the amount of entropic regularization and unbalancedness on the source marginal, respectively. Higher entropic regularization speeds up the optimization and improves statistical properties of the solution [Cuturi, 2013]; unbalancedness makes the solution more robust with respect to uncertain cellular growth rates and biased cell sampling [Chizat et al., 2018, Schiebinger et al., 2019].

# Set up the RealTimeKernel
The RealTimeKernel is CellRank’s interface with time-course data; it can load cellular couplings computed with moscot or Waddington OT.

In [None]:
from moscot.problems.time import TemporalProblem

# import cellrank as cr
# import scanpy as sc
from cellrank.kernels import RealTimeKernel

sc.settings.set_figure_params(frameon=False, dpi=100)
cr.settings.verbosity = 2
tmk = RealTimeKernel.from_moscot(tp)
print(tmk)

To get from a set of OT transport maps to a Markov chain describing a biological system, we do the following:

we sparsify OT transport maps by removing entries below a certain threshold; entropic regularization yields dense matrices which would make CellRank analysis very slow.
we use OT transport maps and molecular similarity to model transitions across and within time points, respectively.
we row-normalize the resulting cell-cell transition matrix (including all time points) and construct the Markov chain.

In [None]:
tmk.compute_transition_matrix(self_transitions="all", conn_weight=0.2, threshold="auto")

# Visualize the recovered dynamics
We can visualize the cellular dynamics described by this Markov chain by sampling random walks.

In [None]:
import sys
import matplotlib.pyplot as plt

tmk.plot_random_walks(
    max_iter=500,
    start_ixs={"day_numerical": 1},
    basis="X_draw_graph_fr",
    seed=0,
    dpi=150,
    size=30,
    save = "tmk_random_walks.pdf"
)

Black and yellow dots denote random walks starting and finishing points, respectively. Random walks mostly finish in the iPSC, Neural, Stromal, Trophoblast and Epithelial cell sets.

Another way to visualize the reconstructed dynamics is by plotting the probability mass flow in time [Mittnenzweig et al., 2021].

We should see black before yellow, but this is not the case...

In [None]:
tmk.plot_random_walks(
    max_iter=500,
    start_ixs={"day_numerical": 1},
    basis="umap",
    seed=0,
    dpi=150,
    size=30,
    save = "tmk_random_walks.pdf"
)

Another way to visualize the reconstructed dynamics is by plotting the probability mass flow in time [Mittnenzweig et al., 2021].

In [None]:
leiden_clusters = adata.obs['clusters_renamed'].unique()
# Convert to a list
leiden_clusters_list = leiden_clusters.tolist()

print(leiden_clusters)
ax = tmk.plot_single_flow(
    cluster_key="clusters_renamed",
    time_key="day_numerical",
    cluster="Multipotent Progenitors",
    min_flow=0.1,
    xticks_step_size=4,
    show=False,
    clusters=leiden_clusters_list, save = "tmk_plot_single_flow.pdf"
)

_ = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

In [None]:
sc.pl.umap(adata_filtered, color = "clusters_renamed")

In [None]:
g = cr.estimators.GPCCA(tmk)
print(g)

In [None]:
g.fit(cluster_key="clusters_renamed", n_states=[4, 12])
g.plot_macrostates(which="all", discrete=True, legend_loc="right", s=100)

In [None]:
g.predict_terminal_states()
g.plot_macrostates(which="terminal", legend_loc="right", s=100)

In [None]:
g.plot_macrostates(which="terminal", discrete=False)

In [None]:
g.predict_initial_states()
g.plot_macrostates(which="initial", legend_loc="right", s=100)

In [None]:
sc.pl.embedding(
    adata_filtered,
    basis="umap",
    color=["CD44", "CD22", "CCNB1", "TOP2A", "EGFR", "VIM"],
    size=50,
)

In [None]:
g.plot_coarse_T()
plt.show()

# Try PseudotimeKernel

In [None]:
pk = cr.kernels.PseudotimeKernel(adata_filtered, time_key="dpt_pseudotime")
pk.compute_transition_matrix()

print(pk)

In [None]:
pk.plot_projection(basis="umap", recompute=True, color="clusters_renamed")

In [None]:
g = cr.estimators.GPCCA(pk)
print(g)

In [None]:
g.fit(cluster_key="clusters_renamed", n_states=[4, 12])
g.plot_macrostates(which="all", discrete=True, legend_loc="right", s=100)

In [None]:
g.predict_terminal_states()
g.plot_macrostates(which="terminal", legend_loc="right", s=100)

In [None]:
g.plot_macrostates(which="terminal", discrete=False)

In [None]:
g.predict_initial_states(allow_overlap=True)
g.plot_macrostates(which="initial", legend_loc="right", s=100)

Since I need the Multipotent Progenitors is taken as initial state, I set the others as terminal removing the 'Multipontent Progenitors' from the list

In [None]:
g.set_terminal_states(states=['G1/S Replicative_1', 'Committed NPC_1', 'Commited NPC_2', 'G2/M Proliferative', 'G1/S Replicative_2'])

In [None]:
g.plot_coarse_T()
plt.show()

In [None]:
g.plot_macrostate_composition(key="clusters_renamed", figsize=(7, 4))
plt.show()

In [None]:
g.compute_fate_probabilities()
g.plot_fate_probabilities(same_plot=False)

In [None]:
g.plot_fate_probabilities(same_plot=True)

In [None]:
cr.pl.circular_projection(adata_filtered, keys=["clusters_renamed"], legend_loc="right")
plt.show()

## Commited NPC1 lineage drivers

In [None]:
driver_clusters = ['G1/S Replicative', 'Committed NPC_1', 'Commited NPC_2', 'G2/M Proliferative', 'G1/S Replicative']

delta_df = g.compute_lineage_drivers(
    lineages=["Committed NPC_1"], cluster_key="clusters_renamed", clusters=driver_clusters
)
delta_df.head(10)

In [None]:
adata_filtered.obs["fate_probabilities_Committed NPC_1"] = g.fate_probabilities["Committed NPC_1"].X.flatten()

sc.pl.embedding(
    adata_filtered,
    basis="umap",
    color=["fate_probabilities_Committed NPC_1"] + list(delta_df.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)

In [None]:
driver_clusters = ['G1/S Replicative', 'Committed NPC_1', 'Commited NPC_2', 'G2/M Proliferative', 'G1/S Replicative']

delta_df = g.compute_lineage_drivers(
    lineages=["Commited NPC_2"], cluster_key="clusters_renamed", clusters=driver_clusters
)
delta_df.head(10)

In [None]:
adata_filtered.obs["fate_probabilities_Commited NPC_2"] = g.fate_probabilities["Commited NPC_2"].X.flatten()

sc.pl.embedding(
    adata_filtered,
    basis="umap",
    color=["fate_probabilities_Commited NPC_2"] + list(delta_df.index[:8]),
    color_map="viridis",
    s=50,
    ncols=3,
    vmax="p96",
)

In [None]:
g.lineage_drivers

In [None]:
import pandas as pd
df = pd.DataFrame(g.lineage_drivers)
df.to_csv('lineage_drivers.csv', index=True)



In [None]:
# compute driver genes
delta_df = g.compute_lineage_drivers(
    lineages=["Committed NPC_1","Commited NPC_2"], cluster_key="clusters_renamed", clusters=driver_clusters
)

# define set of genes to annotate
Committed_NPC_2_genes = ["SCG2", "DLX2","DAAM1","IGFBP5","DLX1","MIAT","RND3"]
Committed_NPC_1_genes = ["REC8", "SLC38A1","RGS16","RIMS3","SERPINF1","ARGLU1","AUXG01000058.1","GATM"]

genes_oi = {
    "Committed NPC_1_genes": Committed_NPC_1_genes,
    "Commited NPC_2_genes": Committed_NPC_2_genes,
}

# make sure all of these exist in AnnData
assert [
    gene in adata_filtered.var_names for genes in genes_oi.values() for gene in genes
], "Did not find all genes"

# compute mean gene expression across all cells
adata_filtered.var["mean expression"] = adata_filtered.X.A.mean(axis=0)

# visualize in a scatter plot
g.plot_lineage_drivers_correlation(
    lineage_x="Committed NPC_1",
    lineage_y="Commited NPC_2",
    adjust_text=True,
    gene_sets=genes_oi,
    color="mean expression",
    legend_loc="none",
    figsize=(5, 5),
    dpi=150,
    fontsize=9,
    size=50
)
plt.show()

In [None]:
# I pickle all the objects
# I save the tp object with pickle
import pickle

# Save the object to a file
with open('g.pkl', 'wb') as f:
    pickle.dump(g, f)

with open('pk.pkl', 'wb') as f:
    pickle.dump(pk, f)



adata_filtered.write("time_course_cluster_9_out.h5ad")

In [None]:
# I'll start from here
import os
os.environ["CFLAGS"] = "-std=c99"
#!pip install rpy2

In [None]:
# pip install --upgrade rpy2


In [None]:
# Load Libraries
import pickle
from moscot.problems.time import TemporalProblem
import cellrank as cr
import scanpy as sc
from cellrank.kernels import RealTimeKernel
import rpy2

#Load adata_filtered
adata_filtered = sc.read("time_course_cluster_9_out.h5ad")
# Load pickle objects g,pk
# Run to load tp


# Load the object from the file
with open('g.pkl', 'rb') as f:
       g = pickle.load(f)

with open('pk.pkl', 'rb') as f:
   pk = pickle.load(f)
    
print(g,pk)


In [None]:
import matplotlib.pyplot as plt

In [None]:
#model = cr.models.GAMR(adata_filtered, n_knots=6, smoothing_penalty=10.0)\
model = cr.models.GAM(adata_filtered, distribution='gaussian', link= 'identity')

In [None]:
model

In [None]:
# I need to install MAGIG

In [None]:
# uncompatibility issue
#  !pip install pandas==1.5.3 # moscot needs > 2
# I try with the other
#!pip install --upgrade fcsparser

In [None]:
#!pip install --user magic-impute

In [None]:
#sc.external.pp.magic(adata_filtered,n_jobs=8)

In [None]:
cr.pl.gene_trends(
    adata_filtered,
    model=model,
    #data_key="magic_imputed_data",
    genes=Committed_NPC_2_genes,
    same_plot=True,
    ncols=3,
    time_key="dpt_pseudotime",
    hide_cells=True,
    weight_threshold=(1e-3, 1e-3),
)

In [None]:
# compute putative drivers for the Beta trajectory
cr.pl.gene_trends(
    adata_filtered,
    model=model,
    #data_key="magic_imputed_data",
    genes=Committed_NPC_1_genes,
    same_plot=True,
    ncols=3,
    time_key="dpt_pseudotime",
    hide_cells=True,
    weight_threshold=(1e-3, 1e-3),
)

In [None]:
print(Committed_NPC_2_genes,Committed_NPC_1_genes)

In [None]:
sc.pl.umap(adata_filtered, color="sample")

In [None]:
sc.pl.violin(adata_filtered, keys=["dpt_pseudotime"], groupby="clusters_renamed", rotation=90)

In [None]:
# compute putative drivers for the Beta trajectory
Committed_NPC_1_drivers = g.compute_lineage_drivers(lineages="Committed NPC_1")

# plot heatmap
cr.pl.heatmap(
    adata_filtered,
    model=model,  # use the model from before
    lineages="Committed NPC_1",
    cluster_key="clusters_renamed",
    show_fate_probabilities=True,
    #data_key="magic_imputed_data",
    genes=Committed_NPC_1_drivers.head(40).index,
    time_key="dpt_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

In [None]:
# compute putative drivers for the Beta trajectory
Commited_NPC_2_drivers = g.compute_lineage_drivers(lineages="Commited NPC_2")

# plot heatmap
cr.pl.heatmap(
    adata_filtered,
    model=model,  # use the model from before
    lineages="Commited NPC_2",
    cluster_key="clusters_renamed",
    show_fate_probabilities=True,
    #data_key="magic_imputed_data",
    genes=Commited_NPC_2_drivers.head(40).index,
    time_key="dpt_pseudotime",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

## DLX1 and DLX2 in NPC2

DLX1 and DLX2 are homeobox transcription factors that play crucial roles in the development and differentiation of neural cells. Their overexpression in NPC2 glioblastoma signatures, as opposed to NPC1, could be due to several reasons:
Neural Differentiation Pathways: DLX1 and DLX2 are involved in the differentiation of neural progenitor cells into specific types of neurons, particularly GABAergic interneurons1. The overexpression of these genes in NPC2 might indicate a differentiation pathway that is more active or predominant in this subtype.
Cellular Identity and Function: The distinct expression patterns of DLX1 and DLX2 could reflect differences in the cellular identity and function of NPC1 and NPC2 cells. NPC2 cells might be more committed to a specific lineage or function that requires higher levels of these transcription factors2.
Tumor Microenvironment: The tumor microenvironment can influence gene expression. NPC2 glioblastoma cells might be exposed to different signals or stresses that upregulate DLX1 and DLX2, contributing to their unique signature3.
Regulatory Networks: DLX1 and DLX2 are part of complex regulatory networks that control cell proliferation, migration, and differentiation. Differences in these networks between NPC1 and NPC2 could lead to the observed differences in gene expression1.
Understanding these mechanisms can provide insights into the biology of glioblastoma and potentially identify targets for therapeutic intervention. If you need more detailed information or have further questions, feel free to ask!
1: Dlx1/2 are Central and Essential Components in the Transcriptional Code for Generating Olfactory Bulb Interneurons 2: Expression of dlx genes in the normal and regenerating brain of adult zebrafish 3: Dlx1 and Dlx2 Promote Interneuron GABA Synthesis, Synaptogenesis, and Survival