# Excitatory Neuron Analysis
This notebook contains the analysis of the excitatory neurons in the dataset.

In [None]:
import scProject
import numpy as np
import scanpy as sc
patterns = sc.read_h5ad('data/patterns_anndata.h5ad')
dataset = sc.read_h5ad('data/test_target.h5ad')
dataset_filtered, patterns_filtered = scProject.matcher.filterAnnDatas(dataset, patterns, 'gene_id')

Weight .001 only 1% lasso to encourage as many features to show up. Then, we will increase the lasso and the regularization to see what drops out.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']= [10, 12]
scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, 'EN01', .0001, .01)
scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, 'CellType', 12, 'EN01', 'PearsonEN01', True)

In [None]:
plt.rcParams['figure.figsize']= [12, 10]
scProject.viz.UMAP_Projection(dataset_filtered, 'CellType', 'EN01', 'UMAPEN01', 20)

Since there is so much heterogeneity in excitatory neurons features could have an unimpressive pearson value, but still be a driver of one cluster of excitatory neurons. These features are obviously valuable in uderstanding the sub types of excitatory neurons. While it is inconvenient to look through all of the plots, it is the easiest way to find the interesting ones. For brevity, I condensed it down from the 80 to the interesting ones for this notebook. This is a lot of features so I am going to regress again with a higher lasso to see which ones dropout. There dropout does not mean that they are not important, but rather that they are not the strongest drivers.

In [None]:
scProject.viz.featurePlots(dataset_filtered, [15, 19, 36, 53, 63, 65, 67], 'EN01', 'UMAPEN01')

While the other features are lighting up some sub types of excitatory neurons feature 36 looks really interesting because it only lights up half of each cluster it is expressed in. Next, I am going to up the lasso to 75% to see which features are the strongest drivers.

In [None]:
plt.rcParams['figure.figsize']= [10, 12]
scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, 'EN60', .0001, .75)
scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, 'CellType', 12, 'EN60', 'PearsonEN60', True, row_cluster=False, col_cluster=False)

In [None]:
plt.rcParams['figure.figsize']= [12, 10]
scProject.viz.UMAP_Projection(dataset_filtered, 'CellType', 'EN60', 'UMAPEN60', 20)

Because of how scProject is set up we can use the UMAP coordinates from another regression and put new feature weights onto it this can allow the user to see how things changed on the same UMAP coordinates. Feature 20 goes to 0:

In [None]:
scProject.viz.featurePlots(dataset_filtered, [15, 19, 36, 63, 65], 'EN60', 'UMAPEN60')

In [None]:
print(scProject.stats.importantGenes(patterns_filtered, 36, .1))
print(scProject.stats.importantGenes(patterns_filtered, 39, .1))
print(scProject.stats.importantGenes(patterns_filtered, 58, .1))
print(scProject.stats.importantGenes(patterns_filtered, 19, .1))

Now these are ensemble ids so I just used https://www.syngoportal.org/convert.html to convert them to gene names.
For Feature 36 the by far highest expressed gene is inactive X specific transcripts(Xist). This dataset is 50% male and 50% female so it would make sense that feature 36 lights up half of the clusters(the female cells).

Feature 39:
ENSMUSG00000041329	ATPase, Na+/K+ transporting, beta 2 polypeptide
ENSMUSG00000001270	creatine kinase, brain
ENSMUSG00000019874	fatty acid binding protein 7, brain
ENSMUSG00000052727	microtubule-associated protein 1B
ENSMUSG00000021268	maternally expressed 3

Feature 58:
ENSMUSG00000021939	cathepsin B 
Very sparse cathepsin B is really high.


Let's visualize the expression of the Xist gene.

In [None]:
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000086503', 36, True)

Here we confirm that feature 36 is one of the largest "expressors" of the Xist in the retina patterns. Since feature 36 is not correlated with a specific cell type, we chose not to use the gene drivers method.

In [None]:
exc = dataset_filtered[dataset_filtered.obs['CellType'].isin(['Excitatory Neurons'])]
print(exc.shape)
E1 = exc[exc.obsm['EN01'][:, 35] > 0]
E2 = exc[exc.obsm['EN01'][:, 35] == 0]
E1.X = np.log1p(E1.X)
E2.X = np.log1p(E2.X)
print(E1.X.shape)
print(E2.X.shape)

In [None]:
scProject.stats.HotellingT2(E1, E2)

In [None]:
df = scProject.stats.BonferroniCorrectedDifferenceMeans(E1, E2, .9999999999999, 'gene_short_name')

In [None]:
filt = ((df['High']> 0) & (df['Low']>0)) | ((df['High']<0) & (df['Low']<0))
df = df[filt]
df['diff'] = df['High'].sub(df['Low'], axis = 0) 
df.sort_values('diff')

# This difference shows that the Xist gene has the lowest variance of the genes.