## PAGA Basic Analysis (Only Main Analysis)

- In this notebook, I include basic leiden clustering with no downstream timeseries analysis

- Mean parameter values are projected onto the space to visualize

- Distributions of genes of known function should be assembled in this notebook

- Finally, basic clustering and ontology enrichment are present at the end

- This notebook should be used to decide on clustering parameters and the resulting paga_df saved to disk

### Note: Next Steps

- Need to rerun some of the sequencing pipeline (the ipynb) to get a proper universal sgRNAid to use in the downstream stuff. For now am hacking my way out of it but this should be done soon.
- Co-embedding doesn't totally line up:
    - Possibly due to differences in the calculated timestep
    - Also might be helped by revisiting the normalization scheme

#### Tomorrow
- First, rerun the sequencing pipeline bit to get sgRNAids that are universal

- Plug it into the correct part of the pipeline to be included in the final dataframes
    - At this step, unify the downstream processing notebooks to analyze both datasets jointly
    - Also, try to keep individual trench traces unaveraged into the downstream steps so inter and intra dataset distances may be compared
    - (2022-02-11) Go to the third notebook and cleanup the errors, then add the distance analysis to the 4a notebook with an overlay

- Rerun the analysis on the old dataset to use the same time calculation since there may be a bias in that method that may compromize the co-embedding
    - Consider doing both permutations of the timepoint calculation
    
- Try co-embedding again and see if the result improves

- If co-embedding looks good, proceed to a distance analysis

- Once this is completed, design a final joint analysis for sgRNA phenotype categories for the second library

In [None]:
import ast
import copy
import random
import warnings

import anndata
import dask
import dask.array as da
import dask.dataframe as dd
import holoviews as hv
import igraph as ig
import leidenalg
import matplotlib as mpl
import matplotlib.gridspec as gridspec
import networkx as nx
import numpy as np
import pandas as pd
import pylab
import scanpy as sc
import scipy as sp
import scipy.cluster.hierarchy as sch
import scipy.sparse
import scipy.stats
import seaborn as sns
import sklearn as skl
import umap
from igraph.drawing.text import TextDrawer
from matplotlib import pyplot as plt
from scanpy.plotting.palettes import default_20, vega_20_scanpy
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering
from sklearn.linear_model import LinearRegression
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import (
    cosine_distances,
    euclidean_distances,
    manhattan_distances,
)
from tslearn.barycenters import (
    dtw_barycenter_averaging,
    euclidean_barycenter,
    softdtw_barycenter,
)
from tslearn.metrics import cdist_soft_dtw, cdist_soft_dtw_normalized
from tslearn.neighbors import KNeighborsTimeSeries

import paulssonlab.deaton.trenchripper.trenchripper as tr

hv.extension("bokeh")
random.seed(42)
np.random.seed(42)

warnings.filterwarnings(action="once", category=UserWarning)

### Initial Data Processing

Here, I am going to try and replicate (to some extant) the corrections from "Genomewide phenotypic analysis of growth, cell morphogenesis, and cell cycle events in Escherichia coli"

#### Start Dask

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="01:00:00",
    local=False,
    n_workers=100,
    n_workers_min=20,
    memory="16GB",
    working_directory="/home/de64/scratch/de64/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

In [None]:
# dask_controller.shutdown()

In [None]:
gene_cluster_df_full_w_control = pd.read_pickle(
    "/home/de64/scratch/de64/sync_folder/2021-11-08_lDE20_Final_3/2021-12-07_gene_cluster_df_no_filter.pkl"
)
gene_cluster_df_full = gene_cluster_df_full_w_control.dropna(
    subset=["Gene"]
)  # no control genes

gene_cluster_df_full_w_control_2 = pd.read_pickle(
    "/home/de64/scratch/de64/sync_folder/2022-01-20_lDE20_Final_6/2022-02-10_gene_cluster_df_no_filter.pkl"
)
gene_cluster_df_full_2 = gene_cluster_df_full_w_control_2.dropna(
    subset=["Gene"]
)  # no control genes

In [None]:
###HACK
gene_cluster_df_full = gene_cluster_df_full.reset_index().set_index("sgRNA")
gene_cluster_df_full_2 = gene_cluster_df_full_2.reset_index().set_index("sgRNA")

In [None]:
gene_cluster_df_full["Experiment #"] = [1 for i in range(len(gene_cluster_df_full))]
gene_cluster_df_full_2["Experiment #"] = [2 for i in range(len(gene_cluster_df_full_2))]
gene_cluster_df_full_all = pd.concat([gene_cluster_df_full, gene_cluster_df_full_2])

In [None]:
gene_cluster_df_full_all = (
    gene_cluster_df_full_all.reset_index(drop=False)
    .set_index(["sgRNA", "Experiment #"])
    .sort_index()
)

In [None]:
gene_cluster_df_full_all[
    "Experiment #"
] = gene_cluster_df_full_all.index.get_level_values(1)

## 1) Global Analysis - KNN, Leiden, UMAP and PAGA

In [None]:
def parallel_norm_soft_dtw(X, chunk_size=200):
    X_dask = da.from_array(X, chunks=(chunk_size, X.shape[1], X.shape[2]))
    soft_dtw_arr = da.blockwise(
        cdist_soft_dtw, "ik", X_dask, "itd", X_dask, "ktd", concatenate=True
    ).compute()
    d_ii = np.diag(soft_dtw_arr)
    norm_soft_dtw_arr = soft_dtw_arr - (
        0.5 * (d_ii.reshape((-1, 1)) + d_ii.reshape((1, -1)))
    )
    return norm_soft_dtw_arr

### Relabel timeseries (correct this upstream later)

In [None]:
gene_cluster_df_full_all = gene_cluster_df_full_all.rename(
    columns={
        "Kernel Trace: Division: major_axis_length: Yeo-Johnson: z score": "Division Length Z-score",
        "Kernel Trace: Mean Linear Growth Rate: Volume: Yeo-Johnson: z score": "Linear Growth Rate Z-score",
        "Kernel Trace: Mean Exponential Growth Rate: Volume: Yeo-Johnson: z score": "Exponential Growth Rate Z-score",
        "Kernel Trace: Mean: minor_axis_length: Yeo-Johnson: z score": "Width Z-score",
        "Kernel Trace: Mean: mCherry Intensity: Yeo-Johnson: z score": "mCherry Intensity Z-score",
        "Kernel Trace: Delta time (s): Yeo-Johnson: z score": "Doubling Time Z-score",
        "Kernel Trace: Division: major_axis_length": "Division Length",
        "Kernel Trace: Mean Linear Growth Rate: Volume": "Linear Growth Rate",
        "Kernel Trace: Mean Exponential Growth Rate: Volume": "Exponential Growth Rate",
        "Kernel Trace: Mean: minor_axis_length": "Width",
        "Kernel Trace: Mean: mCherry Intensity": "mCherry Intensity",
        "Kernel Trace: Delta time (s)": "Doubling Time",
    }
)

### Take mean z-scores over the timeseries

In [None]:
traces = [
    "Linear Growth Rate",
    "Exponential Growth Rate",
    "Division Length",
    "Width",
    "mCherry Intensity",
    "Doubling Time",
]

zscore_traces = [trace + " Z-score" for trace in traces]

for trace in traces:
    avg = gene_cluster_df_full_all.apply(lambda x: np.mean(x[trace]), axis=1)
    gene_cluster_df_full_all[trace + ": Mean"] = avg

for zscore_trace in zscore_traces:
    avg_zscore = gene_cluster_df_full_all.apply(
        lambda x: np.mean(x[zscore_trace]), axis=1
    )
    gene_cluster_df_full_all[zscore_trace + ": Mean"] = avg_zscore

### Filter for strong effects by taking max over integrated zscores

In [None]:
min_feature_thr = 30

gene_cluster_df_filtered = gene_cluster_df_full_all[
    gene_cluster_df_full_all["Integrated Feature Max"] > min_feature_thr
]

In [None]:
plt.hist(
    gene_cluster_df_full[
        gene_cluster_df_full["Integrated Feature Max"] > min_feature_thr
    ]["Integrated Feature Max"],
    bins=50,
    range=(0, 50),
)
plt.hist(
    gene_cluster_df_full[
        gene_cluster_df_full["Integrated Feature Max"] < min_feature_thr
    ]["Integrated Feature Max"],
    bins=50,
    range=(0, 50),
)
plt.show()

### soft-DTW Calculation

In [None]:
X = np.array(gene_cluster_df_filtered["Feature Vector"].tolist())
X = np.swapaxes(X, 1, 2)
norm_soft_dtw_arr = parallel_norm_soft_dtw(X)

### Initialize Anndata Object

In [None]:
an_df = anndata.AnnData(
    X=X.reshape(X.shape[0], -1), obs=gene_cluster_df_filtered
)  # AnnData container to use scanpy functions with unwrapped time vector

### Compute KNN Graph

tune hyperparam search for co-clustering sgRNAs from same genes

In [None]:
n_neighbors = 15
n_pcs = 20  # This shouldn't affect anything

sc.pp.neighbors(an_df, n_neighbors=n_neighbors, n_pcs=n_pcs)
knn_indices, knn_dists, forest = sc.neighbors.compute_neighbors_umap(
    norm_soft_dtw_arr, n_neighbors=n_neighbors, metric="precomputed"
)
(
    an_df.uns["neighbors"]["distances"],
    an_df.uns["neighbors"]["connectivities"],
) = sc.neighbors._compute_connectivities_umap(
    knn_indices,
    knn_dists,
    an_df.shape[0],
    n_neighbors,  # change to neighbors you plan to use
)
an_df.obsp["distances"] = an_df.uns["neighbors"]["distances"]
an_df.obsp["connectivities"] = an_df.uns["neighbors"]["connectivities"]
an_df.obsp["soft_dtw"] = norm_soft_dtw_arr

### Computing Leiden, PAGA and UMAP

Note that the lower resolution UMAP was set to the same UMAP positions as the higher resolution UMAP

In [None]:
min_dist = 0.1
spread = 5.0
# spread = 1.

paga_df_dict = {}
for resolution in [0.25, 1.0, 1.5, 3.0]:
    paga_df_dict[resolution] = copy.deepcopy(an_df)
    sc.tl.leiden(paga_df_dict[resolution], resolution=resolution, n_iterations=-1)
    sc.tl.paga(paga_df_dict[resolution], groups="leiden")
    sc.pl.paga(paga_df_dict[resolution], add_pos=True, show=False)
sc.tl.umap(paga_df_dict[1.0], init_pos="paga", min_dist=min_dist, spread=spread)
paga_df_dict[1.0].obs["leiden_lowres"] = paga_df_dict[0.25].obs["leiden"]
paga_df_dict[1.0].obs["leiden_highres"] = paga_df_dict[1.5].obs["leiden"]
paga_df_dict[1.0].obs["leiden_ultrahighres"] = paga_df_dict[3.0].obs["leiden"]
paga_df = paga_df_dict[1.0]

In [None]:
fig = sc.pl.umap(
    paga_df,
    color=["leiden_lowres", "leiden", "leiden_highres", "leiden_ultrahighres"],
    title=[
        "Leiden Resolution=0.25",
        "Leiden Resolution=1.",
        "Leiden Resolution=1.5",
        "Leiden Resolution=3.",
    ],
    show=False,
    legend_loc="on data",
    edges=True,
    add_outline=False,
    size=100,
    return_fig=True,
    palette=vega_20_scanpy,
    ncols=2,
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title(ax.get_title(), fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
# fig.savefig("./Global_PAGA.png",dpi=500)

In [None]:
paga_df.obs["Experiment #"] = paga_df.obs["Experiment #"].astype("category")

In [None]:
fig = sc.pl.umap(
    paga_df,
    color=["Experiment #"],
    title=["Experiment #"],
    show=False,
    edges=False,
    add_outline=False,
    size=10,
    return_fig=True,
    ncols=2,
)

In [None]:
paga_df.obs

### Plotting Mean Z-scores, Euclidean Norm, and N Match

In [None]:
paga_df.obs["N Match"] = 20.0 - paga_df.obs["N Mismatch"]
del_N_match_series = paga_df.obs.groupby("TargetID").apply(
    lambda x: x["N Match"] - np.min(x["N Match"])
)
del_N_match_series = del_N_match_series.droplevel("TargetID")
paga_df.obs["Delta N Match"] = del_N_match_series

In [None]:
labels = [zscore_trace + ": Mean" for zscore_trace in zscore_traces]

fig = sc.pl.umap(
    paga_df,
    color=labels,
    show=False,
    legend_loc="on data",
    add_outline=False,
    size=50,
    return_fig=True,
    vcenter=0.0,
    cmap="RdBu_r",
    wspace=0.25,
    ncols=3,
)

axes = fig.get_axes()
for ax in axes:
    ax.set_title(ax.get_title(), fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)

# fig.savefig("./Mean_zscores.png",dpi=300)

In [None]:
labels = [trace + ": Mean" for trace in traces]

fig = sc.pl.umap(
    paga_df,
    color=labels,
    show=False,
    legend_loc="on data",
    add_outline=False,
    size=50,
    return_fig=True,
    cmap="RdBu_r",
)
# fig.savefig("./1_Global_Analysis/Mean.png",dpi=300)

In [None]:
labels = ["Delta N Match", "Integrated Euclidean Norm"]

fig = sc.pl.umap(
    paga_df,
    color=labels,
    show=False,
    legend_loc="on data",
    add_outline=False,
    size=50,
    return_fig=True,
    vcenter=0.0,
    cmap="RdBu_r",
)
# fig.savefig("./1_Global_Analysis/Match_and_Euc_Norm.png",dpi=300)

### Highlight Genes of Interest

In [None]:
import goatools
import goatools.base
from goatools.anno.gaf_reader import GafReader
from goatools.base import download_go_basic_obo
from goatools.go_enrichment import GOEnrichmentStudy
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
from goatools.obo_parser import GODag
from goatools.semantic import TermCounts, get_info_content, semantic_similarity


def search_go(ns2assoc, obodag, inv_gene_to_id, go_term):
    namespace_abbv = {
        "biological_process": "BP",
        "molecular_function": "MF",
        "cellular_component": "CC",
    }

    print("Searching for " + str(obodag[go_term].name))
    namespace = namespace_abbv[obodag[go_term].namespace]
    child_goterms = list(obodag[go_term].get_all_children())
    gene_list = [
        inv_gene_to_id[key]
        for key, val in ns2assoc[namespace].items()
        if go_term in val
    ]
    for child_goterm in child_goterms:
        gene_list += [
            inv_gene_to_id[key]
            for key, val in ns2assoc[namespace].items()
            if child_goterm in val
        ]
    gene_list = sorted(list(set(gene_list)))
    return gene_list


def selection_fn(item, gene_name):
    is_gene = item["Gene"] == gene_name
    if is_gene:
        return item["TargetID"]
    else:
        return 0


def highlight_gene_group(an_df, selection_list):
    highlight_genes_df = copy.deepcopy(an_df)

    selection_list = sorted(
        list(
            set(highlight_genes_df.obs["Gene"].unique().tolist()) & set(selection_list)
        )
    )

    for i, selected_gene in enumerate(selection_list):
        selected_series = (highlight_genes_df.obs["Gene"] == selected_gene).astype(
            "category"
        )
        selected_series = selected_series.cat.reorder_categories([True, False])
        highlight_genes_df.obs["Selected Genes: " + str(i)] = selected_series

    selected_series = (highlight_genes_df.obs["Gene"].isin(selection_list)).astype(
        "category"
    )
    selected_series = selected_series.cat.reorder_categories([True, False])
    highlight_genes_df.obs["All Genes"] = selected_series

    # selected_series = (paga_df.obs["Gene"]=="ftsZ").astype(float)
    # selected_series[selected_series==0.] = np.NaN
    # paga_df.obs["Selected Genes"] = selected_series

    fig = sc.pl.umap(
        highlight_genes_df,
        title=selection_list + ["All Genes"],
        color=["Selected Genes: " + str(i) for i in range(len(selection_list))]
        + ["All Genes"],
        groups=[True],
        show=False,
        legend_loc="right margin",
        add_outline=False,
        size=50,
        return_fig=True,
        palette={True: "red", False: "lightgrey"},
    )  # palette ={}

    return fig


def highlight_sgrnas(an_df, selection_list):
    highlight_genes_df = copy.deepcopy(an_df)
    highlight_genes_df.obs["tempindex"] = highlight_genes_df.obs.index

    selection_list = sorted(
        list(
            set(highlight_genes_df.obs["tempindex"].unique().tolist())
            & set(selection_list)
        )
    )
    selected_series = (highlight_genes_df.obs["tempindex"].isin(selection_list)).astype(
        "category"
    )
    selected_series = selected_series.cat.reorder_categories([True, False])
    highlight_genes_df.obs["All sgRNAs"] = selected_series

    # selected_series = (paga_df.obs["Gene"]=="ftsZ").astype(float)
    # selected_series[selected_series==0.] = np.NaN
    # paga_df.obs["Selected Genes"] = selected_series

    fig = sc.pl.umap(
        highlight_genes_df,
        title="All sgRNAs",
        color="All sgRNAs",
        groups=[True],
        show=False,
        legend_loc="right margin",
        add_outline=False,
        size=50,
        return_fig=True,
        palette={True: "red", False: "lightgrey"},
    )  # palette ={}

    return fig

In [None]:
# Get ontologies
obo_fname = download_go_basic_obo()

# Get ecoli association file (ecocyc)
gaf_handle = goatools.base.http_get(
    "http://current.geneontology.org/annotations/ecocyc.gaf.gz", fout="./ecocyc.gaf.gz"
)
gaf_fname = goatools.base.gunzip("./ecocyc.gaf.gz")

## Getting ontologies and other nonesense

obodag = GODag(obo_fname)
objanno = GafReader(gaf_fname)
ns2assoc = objanno.get_ns2assc()

gene_to_id = {assoc.DB_Symbol: assoc.DB_ID for assoc in objanno.associations}
inv_gene_to_id = {assoc.DB_ID: assoc.DB_Symbol for assoc in objanno.associations}
synonym_dict = {
    synonym: assoc.DB_ID
    for assoc in objanno.associations
    for synonym in assoc.DB_Synonym
}
gene_to_id.update(synonym_dict)

In [None]:
fig = highlight_sgrnas(
    paga_df, paga_df.obs[paga_df.obs["Gene"].apply(lambda x: "fts" in x)].index.tolist()
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title("fts Genes", fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./fts_genes.png", dpi=500)

In [None]:
fig = highlight_sgrnas(
    paga_df,
    paga_df.obs[
        paga_df.obs["Gene"].apply(lambda x: ("rps" in x) | ("rpm" in x) | ("rpl" in x))
    ].index.tolist(),
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title("Ribosomal Protein Genes", fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./ribosomal_protein_genes.png", dpi=500)

In [None]:
tRNA_aminoacylation_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0043039")

fig = highlight_sgrnas(
    paga_df,
    paga_df.obs[
        paga_df.obs["Gene"].apply(lambda x: x in tRNA_aminoacylation_genes)
    ].index.tolist(),
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title("tRNA Aminoacetylation Genes", fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./tRNA_aminoacetylation_genes.png", dpi=500)

In [None]:
fig = highlight_sgrnas(
    paga_df,
    paga_df.obs[
        paga_df.obs["Gene"].apply(lambda x: ("tufA" in x) | ("tufB" in x))
    ].index.tolist(),
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title("tufAB Genes", fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./eftu.png", dpi=500)

In [None]:
tRNA_list = [
    "alaW",
    "alaX",
    "alaV",
    "alaU",
    "alaT",
    "argY",
    "argZ",
    "argQ",
    "argV",
    "argX",
    "argW",
    "argU",
    "asnU",
    "asnW",
    "asnT",
    "asnV",
    "aspV",
    "aspT",
    "aspU",
    "cysT",
    "glnX",
    "glnV",
    "glnW",
    "glnU",
    "gltT",
    "gltU",
    "gltV",
    "gltW",
    "glyU",
    "glyW",
    "glyX",
    "glyY",
    "glyV",
    "glyT",
    "hisR",
    "ileY",
    "ileX",
    "ileV",
    "ileT",
    "ileU",
    "metZ",
    "metV",
    "metW",
    "metY",
    "leuX",
    "leuV",
    "leuT",
    "leuP",
    "leuQ",
    "leuU",
    "leuZ",
    "leuW",
    "lysT",
    "lysY",
    "lysV",
    "lysZ",
    "lysW",
    "lysQ",
    "metU",
    "metT",
    "pheU",
    "pheV",
    "proK",
    "proL",
    "proM",
    "selC",
    "serU",
    "serV",
    "serW",
    "serX",
    "serT",
    "thrW",
    "thrV",
    "thrT",
    "thrU",
    "trpT",
    "tyrV",
    "tyrT",
    "tyrU",
    "valV",
    "valW",
    "valT",
    "valY",
    "valU",
    "valX",
    "valZ",
]

fig = highlight_sgrnas(
    paga_df,
    paga_df.obs[paga_df.obs["Gene"].apply(lambda x: x in tRNA_list)].index.tolist(),
)
axes = fig.get_axes()
for ax in axes:
    ax.set_title("tRNA Genes", fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./tRNA_genes.png", dpi=500)

In [None]:
fts_genes = (
    paga_df.obs["Gene"][
        paga_df.obs["Gene"].apply(lambda x: ("rplJ" in x) | ("rplL" in x))
    ]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, fts_genes)
axes = fig.get_axes()
for ax in axes:
    ax.set_title(ax.get_title(), fontsize=18)
    ax.set_ylabel(ax.get_ylabel(), fontsize=18)
    ax.set_xlabel(ax.get_xlabel(), fontsize=18)
fig.savefig("./ribostalk.png", dpi=500)

In [None]:
rne_genes = (
    paga_df.obs["Gene"][
        paga_df.obs["Gene"].isin(["dnaA", "dnaB", "dnaE", "rne", "rnhB"])
    ]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, rne_genes)
# fig.savefig("./1_Global_Analysis/Highlight_Genes/fts_genes.png",dpi=150)

In [None]:
fts_genes = (
    paga_df.obs["Gene"][paga_df.obs["Gene"].apply(lambda x: "fts" in x)]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, fts_genes)
# fig.savefig("./1_Global_Analysis/Highlight_Genes/fts_genes.png",dpi=150)

In [None]:
mre_genes = (
    paga_df.obs["Gene"][paga_df.obs["Gene"].apply(lambda x: "mre" in x)]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, mre_genes)
# fig.savefig("./1_Global_Analysis/Highlight_Genes/fts_genes.png",dpi=150)

In [None]:
sec_and_bam_genes = (
    paga_df.obs["Gene"][
        paga_df.obs["Gene"].apply(
            lambda x: ("sec" in x) or ("bam" in x) or ("yidC" in x) or ("yajC" in x)
        )
    ]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, sec_and_bam_genes)
# fig.savefig("./1_Global_Analysis/Highlight_Genes/sec_and_bam_genes.png",dpi=150)

In [None]:
hol_genes = (
    paga_df.obs["Gene"][paga_df.obs["Gene"].apply(lambda x: ("hol" in x))]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, hol_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/hol_genes.png", dpi=150)

In [None]:
rpo_genes = (
    paga_df.obs["Gene"][paga_df.obs["Gene"].apply(lambda x: ("rpo" in x))]
    .unique()
    .tolist()
)
fig = highlight_gene_group(paga_df, rpo_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/rpo_genes.png", dpi=150)

In [None]:
all_genes = paga_df.obs["Gene"].unique().tolist()
step = 75
for idx, i in enumerate(list(range(0, len(all_genes), step))):
    all_genes_sub = all_genes[i : i + step]
    fig = highlight_gene_group(paga_df, all_genes_sub)
    fig.savefig(
        "./1_Global_Analysis/Highlight_Genes/All_Genes/all_genes_" + str(idx) + ".png",
        dpi=75,
    )

### Highlight Genes of Interest (by GO)

In [None]:
division_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0051301")
fig = highlight_gene_group(paga_df, division_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/division_genes.png", dpi=150)

In [None]:
ribosome_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0005840")
fig = highlight_gene_group(paga_df, ribosome_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/ribosome_genes.png", dpi=150)

In [None]:
peptidoglycan_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0000270")
fig = highlight_gene_group(paga_df, peptidoglycan_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/peptidoglycan_genes.png", dpi=150)

In [None]:
replication_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0006260")
fig = highlight_gene_group(paga_df, replication_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/replication_genes.png", dpi=150)

In [None]:
initiation_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0006270")
fig = highlight_gene_group(paga_df, initiation_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/initiation_genes.png", dpi=150)

In [None]:
shape_genes = search_go(ns2assoc, obodag, inv_gene_to_id, "GO:0008360")
fig = highlight_gene_group(paga_df, shape_genes)
fig.savefig("./1_Global_Analysis/Highlight_Genes/shape_genes.png", dpi=150)

## 2) Cluster Analysis

### GO Term Enrichment

In [None]:
import goatools
import goatools.base
from goatools.anno.gaf_reader import GafReader
from goatools.base import download_go_basic_obo
from goatools.go_enrichment import GOEnrichmentStudy
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
from goatools.obo_parser import GODag
from goatools.semantic import TermCounts, get_info_content, semantic_similarity


def search_go(ns2assoc, obodag, inv_gene_to_id, go_term):
    namespace_abbv = {
        "biological_process": "BP",
        "molecular_function": "MF",
        "cellular_component": "CC",
    }

    print("Searching for " + str(obodag[go_term].name))
    namespace = namespace_abbv[obodag[go_term].namespace]
    child_goterms = list(obodag[go_term].get_all_children())
    gene_list = [
        inv_gene_to_id[key]
        for key, val in ns2assoc[namespace].items()
        if go_term in val
    ]
    for child_goterm in child_goterms:
        gene_list += [
            inv_gene_to_id[key]
            for key, val in ns2assoc[namespace].items()
            if child_goterm in val
        ]
    gene_list = sorted(list(set(gene_list)))
    return gene_list


def get_enriched_GO_terms(
    background_gene_list, gene_list, obodag, objanno, ns2assoc, pval=0.05, GO_type="BP"
):
    gene_to_id = {assoc.DB_Symbol: assoc.DB_ID for assoc in objanno.associations}
    synonym_dict = {
        synonym: assoc.DB_ID
        for assoc in objanno.associations
        for synonym in assoc.DB_Synonym
    }
    gene_to_id.update(synonym_dict)

    # background gene set

    all_genes_uniprot = [
        gene_to_id[item] for item in background_gene_list if item in gene_to_id.keys()
    ]
    selected_genes_uniprot = [
        gene_to_id[item] for item in gene_list if item in gene_to_id.keys()
    ]

    print(len(all_genes_uniprot))
    print(len(selected_genes_uniprot))

    goeaobj = GOEnrichmentStudy(
        all_genes_uniprot,  # List of mouse protein-coding genes
        ns2assoc[GO_type],  # geneid/GO associations
        obodag,  # Ontologies
        propagate_counts=True,
        alpha=pval,  # default significance cut-off
        methods=["fdr_bh"],
    )
    # defult multipletest correction method

    goea_results_all = goeaobj.run_study(selected_genes_uniprot, prt=None)
    goea_quiet_sig = [r for r in goea_results_all if r.p_fdr_bh < pval]
    goea_quiet_enriched = [r for r in goea_quiet_sig if r.enrichment == "e"]
    return goea_quiet_enriched


def pick_exemplar(go1, go2, termcounts, obodag, info_thr, pval_factor=2.0):
    info_1_low = get_info_content(go1.GO, termcounts) < info_thr
    info_2_low = get_info_content(go2.GO, termcounts) < info_thr
    if info_1_low and not info_2_low:
        return go2
    elif info_2_low and not info_1_low:
        return go1
    elif info_2_low and info_1_low:
        return go1

    pval_ratio = go1.p_fdr_bh / go2.p_fdr_bh

    if pval_ratio > pval_factor:
        return go2
    elif pval_ratio < (1.0 / pval_factor):
        return go1

    go1_parents = list(obodag[go1.GO].get_all_parents())
    go2_parents = list(obodag[go2.GO].get_all_parents())

    if go2.GO in go1_parents:
        return go2

    elif go1.GO in go2_parents:
        return go1

    return go1


def get_filtered_go_terms(
    obodag, objanno, goea_list, sim_thr=0.05, info_thr=1.0, GO_type="BP"
):
    termcounts = TermCounts(obodag, objanno.get_ns2assc()[GO_type])

    go_term_list = [item.GO for item in goea_list]
    sim_arr = np.zeros((len(go_term_list), len(go_term_list)))
    for i in range(len(go_term_list)):
        for j in range(len(go_term_list)):
            sim_arr[i, j] = semantic_similarity(
                go_term_list[i], go_term_list[j], obodag
            )
    np.fill_diagonal(sim_arr, 0.0)

    working_group_idx = 0
    grouped_terms = {}
    group_exemplars = {}
    go_term_indices = list(range(len(go_term_list)))

    while len(go_term_indices) > 0:
        i = go_term_indices[0]
        most_sim_arg = np.argmax(sim_arr[i])
        sim_score = sim_arr[i, most_sim_arg]
        if sim_score > sim_thr:
            if len(grouped_terms) > 0:
                in_other_group_keys = [
                    key for key, val in grouped_terms.items() if most_sim_arg in val
                ]
                if len(in_other_group_keys) == 1:
                    other_group_idx = in_other_group_keys[0]
                    grouped_terms[other_group_idx] = grouped_terms[other_group_idx] + [
                        i
                    ]
                    group_exemplars[other_group_idx] = pick_exemplar(
                        group_exemplars[other_group_idx],
                        goea_list[i],
                        termcounts,
                        obodag,
                        info_thr,
                    )
                else:
                    grouped_terms[working_group_idx] = [i, most_sim_arg]
                    group_exemplars[working_group_idx] = pick_exemplar(
                        goea_list[i],
                        goea_list[most_sim_arg],
                        termcounts,
                        obodag,
                        info_thr,
                    )
                    working_group_idx += 1
                    go_term_indices.remove(most_sim_arg)
            else:
                grouped_terms[working_group_idx] = [i, most_sim_arg]
                group_exemplars[working_group_idx] = pick_exemplar(
                    goea_list[i], goea_list[most_sim_arg], termcounts, obodag, info_thr
                )
                working_group_idx += 1
                go_term_indices.remove(most_sim_arg)
        go_term_indices.remove(i)

    group_exemplars = list(group_exemplars.values())

    return group_exemplars


def get_GO_assign_dict(selected_goea, cluster_genes_uniprot):
    all_study_items = copy.copy(cluster_genes_uniprot)
    depth_list = sorted(set([item.depth for item in selected_goea]))[::-1]
    assign_dict = {}
    for depth in depth_list:
        go_terms_at_level = [item for item in selected_goea if item.depth == depth]
        for go_term in go_terms_at_level:
            study_item_list = list(go_term.study_items)
            for study_item in study_item_list:
                if study_item in all_study_items:
                    assign_dict[study_item] = go_term.name
                    all_study_items.remove(study_item)

    for remaining_item in all_study_items:
        assign_dict[remaining_item] = "Unassigned"

    return assign_dict

In [None]:
fig = sc.pl.umap(
    paga_df,
    color=["leiden_lowres", "leiden", "leiden_highres", "leiden_ultrahighres"],
    title=[
        "Leiden Resolution=0.25",
        "Leiden Resolution=1.",
        "Leiden Resolution=1.5",
        "Leiden Resolution=3.",
    ],
    show=False,
    legend_loc="on data",
    edges=True,
    add_outline=False,
    size=50,
    return_fig=True,
    palette=vega_20_scanpy,
)

#### Large fts-like Cluster

In [None]:
all_genes = paga_df.obs["Gene"].unique().tolist()

clust_id = 0

clust_id = str(clust_id)
cluster_genes = sorted(
    paga_df.obs[paga_df.obs["leiden"] == clust_id]["Gene"].unique().tolist()
)

goea_quiet_enriched = get_enriched_GO_terms(
    all_genes, cluster_genes, obodag, objanno, ns2assoc, pval=0.05, GO_type="BP"
)
filtered_go_terms = get_filtered_go_terms(
    obodag, objanno, goea_quiet_enriched, sim_thr=0.3, info_thr=1.0
)
go_term_dict = {
    go_term.name: go_term.ratio_in_study[0] for go_term in filtered_go_terms
}
# ttl_terms = np.sum(list(go_term_dict.values()))
# go_term_dict = {key:val/ttl_terms for key,val in go_term_dict.items()}

print()
for key, value in go_term_dict.items():
    print(key, " : ", value)
print()
for i in range(0, len(cluster_genes), 5):
    print(cluster_genes[i : i + 5])

#### Large Width Cluster

In [None]:
all_genes = paga_df.obs["Gene"].unique().tolist()

clust_id = 26

clust_id = str(clust_id)
cluster_genes = sorted(
    paga_df.obs[paga_df.obs["leiden_ultrahighres"] == clust_id]["Gene"]
    .unique()
    .tolist()
)

goea_quiet_enriched = get_enriched_GO_terms(
    all_genes, cluster_genes, obodag, objanno, ns2assoc, pval=0.05, GO_type="BP"
)
filtered_go_terms = get_filtered_go_terms(
    obodag, objanno, goea_quiet_enriched, sim_thr=0.3, info_thr=1.0
)
go_term_dict = {
    go_term.name: go_term.ratio_in_study[0] for go_term in filtered_go_terms
}
# ttl_terms = np.sum(list(go_term_dict.values()))
# go_term_dict = {key:val/ttl_terms for key,val in go_term_dict.items()}

print()
for key, value in go_term_dict.items():
    print(key, " : ", value)
print()
for i in range(0, len(cluster_genes), 5):
    print(cluster_genes[i : i + 5])

In [None]:
all_genes = paga_df.obs["Gene"].unique().tolist()

clust_id = 22
freq_thr = 3

clust_id = str(clust_id)
unique_genes = np.unique(
    paga_df.obs[paga_df.obs["leiden_ultrahighres"] == clust_id]["Gene"],
    return_counts=True,
)
cluster_genes = sorted(unique_genes[0][unique_genes[1] > freq_thr].tolist())

goea_quiet_enriched = get_enriched_GO_terms(
    all_genes, cluster_genes, obodag, objanno, ns2assoc, pval=0.05, GO_type="BP"
)
filtered_go_terms = get_filtered_go_terms(
    obodag, objanno, goea_quiet_enriched, sim_thr=0.3, info_thr=1.0
)
go_term_dict = {
    go_term.name: go_term.ratio_in_study[0] for go_term in filtered_go_terms
}
# ttl_terms = np.sum(list(go_term_dict.values()))
# go_term_dict = {key:val/ttl_terms for key,val in go_term_dict.items()}

print()
for key, value in go_term_dict.items():
    print(key, " : ", value)
print()
for i in range(0, len(cluster_genes), 5):
    print(cluster_genes[i : i + 5])

In [None]:
cluster_genes

In [None]:
fig = highlight_gene_group(paga_df, cluster_genes)

#### Notes for later

- Need to think a little bit more if I am satisfied about this way of viewing clusters of related genes (with the soft-DTW metric)

- Do I want to get a sub-cluster view of these genes and their proximity?

- Do I want graph-based measurements of association?

    - How could I implement this, given that each gene has a varying number of sgRNAs?

- Should I filter by number of sgRNAs to declare a hit significant?

- Should I threshold by more observations to reduce noise?


### Output

In [None]:
paga_df_only = paga_df.obs
paga_df_only.to_pickle("./2021-12-07_paga_df_only.pkl")
paga_df.obs = paga_df.obs[
    paga_df.obs.dtypes[paga_df.obs.dtypes.isin([np.int64, float])].index.tolist()
]
paga_df.write("./2021-12-07_paga_df.h5ad")