# Feature Extraction Analysis for Microglia in P6 mouse Cortex
In this notebook, we provide a brief guide as to how the dataset and patterns are used and analyzed using scProject.

In [None]:
import random
random.seed(a=613)
import numpy as np
import scProject
import scanpy as sc
patterns = sc.read_h5ad('data/patterns_anndata.h5ad')
dataset = sc.read_h5ad('data/p6counts.h5ad')
dataset_filtered, patterns_filtered = scProject.matcher.filterAnnDatas(dataset, patterns, 'id')
print(dataset.shape)

Applying low regularization low 1% lasso as shown below:

In [None]:
import matplotlib.pyplot as plt
from sklearn import linear_model
plt.rcParams['figure.figsize']= [10, 12]
dataset_filtered = scProject.matcher.logTransform(dataset_filtered)
scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, 'MG01', .0001, .01, layer='log', iterations=100000)
scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, 'assigned_cell_type', 11, 'MG01', 'MG01Pears', True, display=False, path='MG/PearsonMicrgliaLowReg.pdf')

In [None]:
plt.rcParams['figure.figsize']= [12, 10]
scProject.viz.UMAP_Projection(dataset_filtered, 'assigned_cell_type', 'MG01', 'UMAPMG01', 20, display=False, path='MG/UMAPlowReg.pdf')

In [None]:
scProject.viz.featurePlots(dataset_filtered, [24,5,6,25,57,58] , 'MG01', 'UMAPMG01', display=False, path='MG/MGLowReg')

As expected from the Pearson plot, features 5 and 24 are the stronger markers of microglia in the P6 mouse cortex. Lets up the lasso to encourage sparsity and strengthen the features that are the real drivers.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']= [10, 12]
scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, 'MG99', .0005, .99, layer='log')
scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, 'assigned_cell_type', 11, 'MG99', 'MG99Pears', True,display=False,row_cluster=False, col_cluster=False,path='MG/MicrogliaPearsonHighReg.pdf')

In [None]:
plt.rcParams['figure.figsize']= [12, 10]
scProject.viz.UMAP_Projection(dataset_filtered, 'assigned_cell_type', 'MG99', 'UMAPMG99', 20,display=False, path='MG/UMAPMGHighReg.pdf')

One nice feature of scProject is that you can use the UMAP coordinates generated from a previous regression. This can be useful when you increase the regularization and the UMAPs become harder to decipher. These are the feature plots, but on the original UMAP coordinates.

In [None]:
scProject.viz.featurePlots(dataset_filtered, [24,5,6,25,57,58], 'MG99', 'UMAPMG01',display=False, path='MG/MGHighReg')

This is the typical feature weights on a newly generated UMAP coordinate system.

In [None]:
scProject.viz.featurePlots(dataset_filtered, [24,5,6,25,57,58], 'MG99', 'UMAPMG99', obsColumn='assigned_cell_type',display=False,path='MG/f5Micro')

While this is clearly over regularized some features 5, 6 and 75(because of how many nonzero cells) persist. Lets print out the highest expressed genes from the features of interest and see what's inside.

In [None]:
print(scProject.stats.importantGenes(patterns_filtered, 5, .05), "Feature 5 Genes")
print(scProject.stats.importantGenes(patterns_filtered, 6, .05), "Feature 6 Genes")
print(scProject.stats.importantGenes(patterns_filtered, 24, .1), "Feature 24 Genes")
print(scProject.stats.importantGenes(patterns_filtered, 57, .01), "Feature 57 Genes")

In short, Feature 24 which is expressed in a fraction of microglia has high expression of C1qa, C1qb, C1qc genes. Feature 24 is not included in all of the microglia suggesting that there exists a subtype of microglia in the P6 mouse cortex that expresses C1qa through c in much higher amounts. Features 5, 6, and 75 do not have the C1qs expressed in their most important genes. This shows a subtyping of microglia in terms of expression of C1q expression.

To better understand the expression of C1q a-c, lets use gene selectivity to understand what the model is doing.

In [None]:
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036887', 5, False) #C1qa
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036887', 6, False) #C1qa
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036887', 24, True) #C1qa

scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036905', 5, False) #C1qb
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036905', 6, False) #C1qb
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036905', 24, True) #C1qb

scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036896', 5, False) #C1qc
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036896', 6, False) #C1qc
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000036896', 24, True) #C1qc

scProject.stats.geneDriver(dataset_filtered, patterns_filtered, 'ENSMUSG00000036887', 'assigned_cell_type',
                                       "Microglia", "MG99")
scProject.stats.geneDriver(dataset_filtered, patterns_filtered, 'ENSMUSG00000036905', 'assigned_cell_type',
                                       "Microglia", "MG99")
scProject.stats.geneDriver(dataset_filtered, patterns_filtered, 'ENSMUSG00000036896', 'assigned_cell_type',
                                       "Microglia", "MG99")

In [None]:
scProject.stats.geneSelectivity(patterns_filtered, 'ENSMUSG00000059498', 24, True)

The model chose the best features to use in the samples. As we saw in the previous plots, feature 5 is the largest driver of microglia. It is important to note that these plots take the average of all of the cells annotated as microglia, but from the feature plots we know that some microglia express more of feature 6 and others more feature 24. The first plots clearly show that feature 24 expresses much more of all three genes. Next, we show the utility of the stats in scProject to see which genes are driving the difference between subtypes. Below, I split the microglia into two groups based on their expression of feature 6.

In [None]:
# histograms to see expression of feature 24 and 57 in microglia
plt.rcParams['figure.figsize']= [12, 10]
scProject.viz.patternWeightDistribution(dataset_filtered, 'MG01', [24,25, 57,58], obsColumn='assigned_cell_type', subset=['Microglia'], numBins=100)

In [None]:
import numpy as np
microglia= dataset_filtered[dataset_filtered.obs['assigned_cell_type'].isin(['Microglia'])].copy()
others= dataset_filtered.obs['assigned_cell_type'].unique().remove_categories('Microglia')
rest = dataset_filtered[dataset_filtered.obs['assigned_cell_type'].isin(list(others))].copy()
print(microglia.shape, rest.shape, dataset_filtered.shape)

microglia.X = np.log2(microglia.X + 1e-30) #log transform for statistical tests
rest.X = np.log2(rest.X + 1e-30) #log transform for statistcal tests

plt.rcParams['figure.figsize'] = [5,50]
df24 = scProject.stats.projectionDriver(patterns_filtered, microglia, rest,.999999999999,'gene_short_name', 24, display=False, path='MG/Micro24Driver.pdf')

f24CIs = df24[0]

f24CIs['rank'] = (f24CIs['High']+f24CIs['Low'])/2.0
f24CIsRank = f24CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 50]
for idx,low, high,y in zip(list(f24CIsRank.index) ,f24CIsRank['Low'], f24CIsRank['High'], range(len(f24CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter == 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 24 Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "Microglia24Ranked.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

# Save the figure
plt.savefig(file_path, bbox_inches='tight')
plt.show()

In [None]:
plt.rcParams['figure.figsize']= [5,50]
df57 = scProject.stats.projectionDriver(patterns_filtered, microglia, rest,.999999999999,'gene_short_name', 57, display=True, path='MG/Micro57Driver.pdf')

f57CIs = df57[0]

f57CIs['rank'] = (f57CIs['High']+f57CIs['Low'])/2.0
f57CIsRank = f57CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 55]
for idx,low, high,y in zip(list(f57CIsRank.index) ,f57CIsRank['Low'], f57CIsRank['High'], range(len(f57CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 57 Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "Microglia57Ranked.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show() # Blank plot

In [15]:
genes24 = set(df24[0].index)
genes57 = set(df57[0].index)

In [None]:
in24 = genes24.difference(genes57)
just24CIs = df24[0].loc[list(in24)]

just24CIs['rank'] = (just24CIs['High']+just24CIs['Low'])/2.0
just24CIsRank = just24CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 8]
for idx,low, high,y in zip(list(just24CIsRank.index) ,just24CIsRank['Low'], just24CIsRank['High'], range(len(just24CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 24 Exclusive Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "JustMicroglia24.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show()

In [None]:
in57 = genes57.difference(genes24)
just57CIs = df57[0].loc[list(in57)]

just57CIs['rank'] = (just57CIs['High']+just57CIs['Low'])/2.0
just57CIsRank = just57CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 30]
for idx,low, high,y in zip(list(just57CIsRank.index) ,just57CIsRank['Low'], just57CIsRank['High'], range(len(just57CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 57 Exclusive Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "JustMicroglia57.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show()

In [None]:
import numpy as np
microglia= dataset_filtered[dataset_filtered.obs['assigned_cell_type'].isin(['Microglia'])].copy()
others= dataset_filtered.obs['assigned_cell_type'].unique().remove_categories('Microglia')
rest = dataset_filtered[dataset_filtered.obs['assigned_cell_type'].isin(list(others))].copy()
print(microglia.shape, rest.shape, dataset_filtered.shape)

microglia.X = np.log2(microglia.X + 1e-30) #log transform for statistical tests
rest.X = np.log2(rest.X + 1e-30) #log transform for statistcal tests

plt.rcParams['figure.figsize']= [5,50]

df58 = scProject.stats.projectionDriver(patterns_filtered, microglia, rest,.999999999999,'gene_short_name', 58, display=True, path='MG/Micro58Driver.pdf')
df25 = scProject.stats.projectionDriver(patterns_filtered, microglia, rest,.999999999999,'gene_short_name', 25, display=True, path='MG/Micro25Driver.pdf')

In [19]:
genes24 = set(df24[0].index)
genes57 = set(df57[0].index)
genes58 = set(df58[0].index)
genes25 = set(df25[0].index)

In [None]:
in25 = genes25.difference(genes57).difference(genes58).difference(genes24)
just25CIs = df25[0].loc[list(in25)]

just25CIs['rank'] = (just25CIs['High']+just25CIs['Low'])/2.0
just25CIsRank = just25CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 8]
for idx,low, high,y in zip(list(just25CIsRank.index) ,just25CIsRank['Low'], just25CIsRank['High'], range(len(just25CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 25 Exclusive Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "25MGvsAll-F57F58F24.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show()

In [None]:
in57 = genes57.difference(genes24).difference(genes58).difference(genes25)
just57CIs = df57[0].loc[list(in57)]

just57CIs['rank'] = (just57CIs['High']+just57CIs['Low'])/2.0
just57CIsRank = just57CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 8]
for idx,low, high,y in zip(list(just57CIsRank.index) ,just57CIsRank['Low'], just57CIsRank['High'], range(len(just57CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 57 Exclusive Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "57MGvsAll-F25F58F24.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show()

In [None]:
in58 = genes58.difference(genes57).difference(genes25).difference(genes24)
just58CIs = df58[0].loc[list(in58)]

just58CIs['rank'] = (just58CIs['High']+just58CIs['Low'])/2.0
just58CIsRank = just58CIs.sort_values(by='rank', ascending=True)
counter = 0
yAxis = []
plt.rcParams['figure.figsize']= [4, 8]
for idx,low, high,y in zip(list(just58CIsRank.index) ,just58CIsRank['Low'], just58CIsRank['High'], range(len(just58CIsRank))):
    plt.plot((low, high), (counter, counter), '-', color='blue')
    if counter is 0:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue', label='Mean')
    else:
        plt.plot((float(low+high)/2.0), counter,'o', color='blue')
    yAxis.append(idx)
    counter+=1

plt.yticks(range(len(yAxis)), yAxis)
plt.title("Microglia Feature 58 Exclusive Ranked")
plt.plot((0,0), (0,len(yAxis)), '--', color='black')
plt.ylim(top= len(yAxis)+1)
plt.ylim(bottom=-1)
plt.legend()

import os
# Directory path and filename
directory = "MG"
filename = "58MGvsAll-F57F25F24.pdf"

# Create the directory if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)

# Combine directory and filename to create the full file path
file_path = os.path.join(directory, filename)

plt.savefig(file_path, bbox_inches='tight')
plt.show()