## scGeneFit
Notebook to try and implement scGeneFit method of finding informative marker genes. scGeneFit finds an N dimensional projection of the data, where each subspace dimension aligns with a dimension of the original data - thus corresponding to single genes, and not linear combinations of many genes.  

https://www.nature.com/articles/s41467-021-21453-4#Sec9


In [1]:
import scGeneFit.functions as gf
import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc
import scipy.io
import scipy.sparse
import matplotlib.pyplot as plt
import pickle
import json

np.random.seed(0)

## Cluster level gene panel

In [2]:
#Load data
gluData = sc.read("../Data/clusterData.h5ad")



In [3]:
# Don't want to select from unsuitable genes, so load list of genes that are suspected to be unsuitable for MERSCOPE
with open("../Data/badGenes.json", "r") as f:
    badGenes = json.load(f)
    
# Remove genes from AnnData object
keepGenes = list(set(gluData.var_names) - set(badGenes))
gluData = gluData[:,keepGenes]

In [4]:
clusterData = gluData.X.toarray()
clusterLabel = gluData.obs["cluster_label"]

In [5]:
# Really sloppy setup for an overnight run:
method = "centers"
redundancy = 0.2

num_genes = [350, 500]
scGeneFit_results = {}

for num in num_genes:
    print("Starting ", num, " marker run")
    scGeneFit_results[num] = gf.get_markers(clusterData,clusterLabel,num, method = method, redundancy = redundancy)

Starting  350  marker run
Solving a linear program with 26572 variables and 1000 constraints
Time elapsed: 31779.09667444229 seconds
Starting  500  marker run
Solving a linear program with 26572 variables and 1000 constraints
Time elapsed: 26955.354702949524 seconds


In [20]:
# Save results as gene names
geneDict = {}
for key in scGeneFit_results.keys():
    geneDict[key] = list(gluData.var_names[scGeneFit_results[key]])

with open("../Data/scGeneFit_p11_p14_panel.pickle", 'wb') as f:
    pickle.dump(geneDict, f)