# Building Initial Models

My goal for this notebook is to understand how much signal can be extracted from the genes most correlated to the protein presence and if linear models are an appropriate tool for this vector space. Further, the problem statement involves producing different models for 140 continuous targets, namely the proteins whose presence was recorded. This initial model should provide a sense of which proteins may be more challenging to model.

## Imports

In [1]:
import library as lb

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

## Load Data

In [2]:
X_train = pd.read_hdf('./data/train_test_split/X_train_cite_seq.h5')
X_test = pd.read_hdf('./data/train_test_split/X_test_cite_seq.h5')
Y_train = pd.read_hdf('./data/train_test_split/Y_train_cite_seq.h5')
Y_test = pd.read_hdf('./data/train_test_split/Y_test_cite_seq.h5')

In [3]:
Y_train.drop(columns = 'to_stratify', inplace = True)
Y_test.drop(columns = 'to_stratify', inplace = True)
# Created during train-test split, not relevent to modeling

In [4]:
corrs = pd.read_csv('./data/train_test_split/cite_seq_train_protein_gene_corrs.csv')

### Reduce the number of cells considered

I will only consider the latest day, day four, for these models in order to tune them. I will consider all the data in a Google Colab Notebook.

In [5]:
train_mask = Y_train['day'] == 4
test_mask = Y_test['day'] == 4

In [6]:
X_train = X_train[train_mask]
X_test = X_test[test_mask]
Y_train = Y_train[train_mask]
Y_test = Y_test[test_mask]

In [7]:
measure_of_all_data = X_train.shape[0] + X_test.shape[0]
X_train.shape[0] / measure_of_all_data, X_test.shape[0] / measure_of_all_data

(0.8, 0.2)

The train-test split was stratified on day 4 so the distribution between train and test is still 80/20.

In [8]:
Y_train['day'].unique()[0], Y_test['day'].unique()[0]

(4, 4)

The mask was applied correctly on the targets, only day 4 is present.

In [9]:
all_true = True
for i in Y_train.index == X_train.index:
    if not i:
        all_true = False
        break
all_true

True

In [10]:
all_true = True
for i in Y_test.index == X_test.index:
    if not i:
        all_true = False
        break
all_true

True

The mask was applied corectly on the predictors, each index pair line up correctly.

## Consider the Correlations of the Genes to the Proteins

The correlations were calculated in a seperate notebook using NVIDIA RAPIDS using the same train-test split global to the project.

### Correlation Analysis

#### Drop missing values

In [11]:
corrs.rename(columns = {'Unnamed: 0': 'gene_id'}, inplace = True)

In [12]:
corrs.set_index('gene_id', inplace = True)

In [13]:
corrs.drop(columns = 'to_stratify', inplace = True)
# Created during train-test split, not relevent to modeling

In [14]:
corrs.head()

Unnamed: 0_level_0,CD86,CD274,CD270,CD155,CD112,CD47,CD48,CD40,CD154,CD52,...,CD94,CD162,CD85j,CD23,CD328,HLA-E,CD82,CD101,CD88,CD224
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000121410_A1BG,-2.2e-05,-0.00255,-0.004659,-0.000762,0.014661,0.014575,-0.000235,0.00049,-0.00629,0.009381,...,0.005153,-0.001545,-0.002458,-0.000402,0.002696,-0.00068,-0.010392,0.009006,-0.018318,-0.001764
ENSG00000268895_A1BG-AS1,0.001822,-0.011199,-0.000575,0.013445,0.014784,0.011216,0.00823,-0.00186,0.002584,0.0116,...,-0.003138,0.014779,-0.005772,0.001068,0.002792,0.005366,-0.000961,0.005244,-0.00358,-0.004096
ENSG00000175899_A2M,0.064626,0.00954,0.021473,0.013768,0.034001,0.009187,0.049535,0.015647,0.009757,0.044291,...,-0.004049,0.003167,0.010101,0.00798,0.150346,0.01581,-0.007537,0.143869,-0.004689,0.021982
ENSG00000245105_A2M-AS1,0.003193,0.011383,0.021819,0.045378,0.069043,0.017083,-0.013548,0.007841,0.015188,0.039427,...,0.010883,-0.001565,0.006488,0.017381,-0.009324,0.013009,0.007729,0.001289,-0.006739,0.035258
ENSG00000166535_A2ML1,0.003951,0.003135,-0.005503,-0.011076,-0.016184,-0.010638,-0.003587,0.000868,-0.004975,-0.007788,...,-0.001255,-0.008089,-0.00254,-0.00229,0.000386,-0.007121,0.001895,-0.002501,0.006183,-0.005017


In [15]:
corrs.isnull().sum().value_counts()

449    140
dtype: int64

There are 449 different genes that never have recorded presence in the cell's transcriptome. So, for all 140 proteins, when calculating the pearson coefficient between these missing in action genes and each protein a null value is the result. So these null values can simply be dropped, we will not be able to extract signal from them.

In [16]:
corrs.shape[0] - corrs.dropna().shape[0]

449

Dropping these nulls values does indeed only drop the 449 genes that are missing in action.

In [17]:
corrs.dropna(inplace = True)

### Select relevent initial Genes

For every protein I am picking the predictive columns that most correlate to the given protein. These correlations have been caculated in seperate notebook.

In [18]:
number_of_genes_to_select = 100
selected_genes = []
for protein in Y_train.columns[4:]:
    array = corrs.abs()[protein].sort_values(ascending = False).iloc[0:number_of_genes_to_select].index.values
    selected_genes.append(array)
# The '.values' lets us grab the genes names as an array instead of as part of Pandas' index class.

In [19]:
len(selected_genes[0])

100

In [20]:
len(selected_genes), sum([len(col_names) for col_names in selected_genes]) / len(selected_genes)

(140, 100.0)

I have a list of genes for each of the 140 portein targets, and each list has a length set by the `number_of_genes_to_select` variable.

In [21]:
X_train.shape, Y_train.shape

((22516, 22050), (22516, 144))

### Consider the correlations between the different genes, the predictive variables.

Given that this is a regression model we have as an assumption that the predictive columns are uncorrelated.

### Consider Predictor Correlations

In [None]:
gene_corrs = X_train.corr()

In [None]:
# Each column only has one case where the correlation is exactly one.
# So I can safely declare all such values as NaN.
gene_corrs.applymap(lambda x: np.nan if x == 1.0 else x).isnull().sum().unique()

In [None]:
gene_corrs = gene_corrs.applymap(lambda x: np.nan if x == 1.0 else x)

In [None]:
gene_corrs.max().describe()

PCA is important here to decorrelate these predictors to perform linear regression.

## Fit Models

### Dummy Models

In [22]:
dumb = DummyRegressor()
dumb_output = lb.fit_and_evaluate_citeseq_models(dumb, selected_genes, X_train, X_test, Y_train, Y_test, figures = False, pca_viz = False, eval_lr_coefs = False)
dumb_output[0].describe()

0.0% complete
10.0% complete
20.0% complete
30.0% complete
40.0% complete
50.0% complete
60.0% complete
70.0% complete
80.0% complete
90.0% complete


Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
count,140.0,140.0,140.0,140.0,140.0,140.0
mean,0.0,5.039047,1.334432,0.0,5.048703,1.340146
std,0.0,9.776774,1.047765,0.0,9.781611,1.052554
min,0.0,0.488356,0.539358,-0.0,0.491506,0.541832
25%,0.0,0.812162,0.709666,-0.0,0.806175,0.707956
50%,0.0,1.207201,0.828066,-0.0,1.164899,0.834159
75%,0.0,5.170775,1.735204,-0.0,5.13386,1.765947
max,0.0,79.279221,5.87496,-0.0,77.830307,5.856011


These are my baseline models for all 140 protein targets. Notice that every model has an R-squared of zero; there is sufficient variation in protein occurence for the mean protein presence across all cells to have no predictive value.

### Simple Linear Regression

In [23]:
lr = LinearRegression()
lr_output = lb.fit_and_evaluate_citeseq_models(lr, selected_genes, X_train, X_test, Y_train, Y_test, figures = False, pca_viz = False, eval_lr_coefs = False)
lr_output[0].describe()

0.0% complete
10.0% complete
20.0% complete
30.0% complete
40.0% complete
50.0% complete
60.0% complete
70.0% complete
80.0% complete
90.0% complete


Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
count,140.0,140.0,140.0,140.0,140.0,140.0
mean,0.233429,2.632121,1.05507,0.223429,2.662942,1.063131
std,0.195631,3.796078,0.607022,0.20043,3.851,0.612793
min,0.01,0.447743,0.515696,0.0,0.451148,0.521736
25%,0.07,0.733318,0.67605,0.05,0.744541,0.685093
50%,0.165,1.056432,0.785525,0.16,1.051766,0.794447
75%,0.4,2.692111,1.218333,0.3725,2.684913,1.221707
max,0.81,26.162706,3.518552,0.81,25.959379,3.566404


In [24]:
lr_output[0].sort_values(by = 'Test R-Squared', ascending = False).head()

Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
CD41,0.81,7.321157,1.801733,0.81,7.542717,1.829246
CD32,0.68,9.217151,2.151177,0.69,8.731278,2.13523
CD36,0.67,26.162706,3.098529,0.67,25.959379,3.118734
CD71,0.65,4.09552,1.478676,0.65,4.006034,1.500753
CD48,0.63,10.466292,2.370926,0.62,11.460934,2.448249


### Linear Regression with PCA

#### Linear Regression with all of the PCA components

In [25]:
lr_with_pca = Pipeline([
                    ('ss', StandardScaler()),
                    ('pca', PCA(random_state = 2022)),
                    ('lr', LinearRegression())
                ])
lr_with_pca_output = lb.fit_and_evaluate_citeseq_models(lr_with_pca, selected_genes, X_train, X_test, Y_train, Y_test, figures = False, pca_viz = False, eval_lr_coefs = False)
lr_with_pca_output[0].describe()

0.0% complete
10.0% complete
20.0% complete
30.0% complete
40.0% complete
50.0% complete
60.0% complete
70.0% complete
80.0% complete
90.0% complete


Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
count,140.0,140.0,140.0,140.0,140.0,140.0
mean,0.233429,2.632122,1.055062,0.223429,2.662928,1.063122
std,0.195631,3.796078,0.607025,0.20043,3.851006,0.612798
min,0.01,0.447743,0.515696,0.0,0.451148,0.521736
25%,0.07,0.733318,0.67605,0.05,0.744543,0.685093
50%,0.165,1.056709,0.785514,0.16,1.051418,0.794447
75%,0.4,2.692111,1.218333,0.3725,2.684913,1.221707
max,0.81,26.162706,3.518552,0.81,25.959379,3.566404


In [26]:
lr_with_pca_output[0].sort_values(by = 'Test R-Squared', ascending = False).head()

Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
CD41,0.81,7.321156,1.801733,0.81,7.542717,1.829246
CD32,0.68,9.217153,2.151177,0.69,8.73128,2.13523
CD36,0.67,26.162706,3.098529,0.67,25.959379,3.118735
CD71,0.65,4.09552,1.478676,0.65,4.006034,1.500753
CD48,0.63,10.466292,2.370926,0.62,11.460931,2.448249


#### Linear Regression with PCA used for additional dimension reduction

In [27]:
lr_with_pca_dim_reduce = Pipeline([
                    ('ss', StandardScaler()),
                    ('pca', PCA(n_components = 10, random_state = 2022)),
                    ('lr', LinearRegression())
                ])

lr_with_pca_dim_reduce_output = lb.fit_and_evaluate_citeseq_models(lr_with_pca_dim_reduce, selected_genes,
                                                                   X_train, X_test, Y_train, Y_test,
                                                                   figures = False, pca_viz = False, eval_lr_coefs = False)
lr_with_pca_dim_reduce_output[0].describe()

0.0% complete
10.0% complete
20.0% complete
30.0% complete
40.0% complete
50.0% complete
60.0% complete
70.0% complete
80.0% complete
90.0% complete


Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
count,140.0,140.0,140.0,140.0,140.0,140.0
mean,0.213643,2.820696,1.077103,0.213143,2.825226,1.080287
std,0.186216,4.303564,0.6421,0.18805,4.318351,0.645992
min,0.0,0.455171,0.520389,0.0,0.454496,0.5236
25%,0.05,0.743746,0.6801,0.0475,0.742162,0.684699
50%,0.155,1.069571,0.788502,0.16,1.045398,0.793383
75%,0.3625,2.77466,1.243681,0.3525,2.798484,1.2432
max,0.8,32.067917,3.631608,0.8,31.332642,3.681198


In [28]:
lr_output[0].sort_values(by = 'Test R-Squared', ascending = False).head()

Unnamed: 0,Train R-Squared,Train Mean-Squared Error,Train Mean-Absolute Error,Test R-Squared,Test Mean-Squared Error,Test Mean-Absolute Error
CD41,0.81,7.321157,1.801733,0.81,7.542717,1.829246
CD32,0.68,9.217151,2.151177,0.69,8.731278,2.13523
CD36,0.67,26.162706,3.098529,0.67,25.959379,3.118734
CD71,0.65,4.09552,1.478676,0.65,4.006034,1.500753
CD48,0.63,10.466292,2.370926,0.62,11.460934,2.448249


## Evaluate Models

### Compare to Baseline

### Check Residuals

#### Linear Regression Models

#### Linear Regression Models with PCA

## Conclusions