<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/Figure_8/Figure_8bc/run_regressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic regressions to predict viral presence based on host gene expression

### NOTE: Running this notebook requires a large amount of memory (~32 GB of RAM), exceeding the memory limits of the free Google Colab version. Change the runtime to a higher memory type (e.g. "A100 GPU") to run this notebook on Google Colab.

In [1]:
!pip install -q anndata

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/122.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/122.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m112.6/122.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Download data
Download count matrices (generated [here](https://github.com/pachterlab/LSCHWCP_2023/tree/main/Notebooks/align_macaque_PBMC_data/1_virus_no_mask) (virus) and [here](https://github.com/pachterlab/LSCHWCP_2023/tree/main/Notebooks/Supp_Fig_3/Supp_Fig_3abc) (macaque):

In [2]:
# Download data from Caltech Data
!wget https://data.caltech.edu/records/sh33z-hrx98/files/virus_no_mask.h5ad?download=1
!mv virus_no_mask.h5ad?download=1 virus_no_mask.h5ad
!wget https://data.caltech.edu/records/sh33z-hrx98/files/macaque_QC_norm_leiden_celltypes.h5ad?download=1
!mv macaque_QC_norm_leiden_celltypes.h5ad?download=1 macaque_QC_norm_leiden_celltypes.h5ad

--2024-05-04 23:02:59--  https://data.caltech.edu/records/sh33z-hrx98/files/virus_no_mask.h5ad?download=1
Resolving data.caltech.edu (data.caltech.edu)... 35.155.11.48
Connecting to data.caltech.edu (data.caltech.edu)|35.155.11.48|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3.us-west-2.amazonaws.com/caltechdata/32/a5/1c1a-bb66-4f66-a133-60763da8d716/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dvirus_no_mask.h5ad&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARCVIVNNAP7NNDVEA%2F20240504%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20240504T230259Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=0c7d3cc0a8e2e3c3231131b30a020e24c1a241341bc6740de1166e571684cfad [following]
--2024-05-04 23:02:59--  https://s3.us-west-2.amazonaws.com/caltechdata/32/a5/1c1a-bb66-4f66-a133-60763da8d716/data?response-content-type=application%2Foctet-stream&response-content-disposition=atta

Download code to run logistic regressions:

In [3]:
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/Figure_8/Figure_8bc/logisticRegression.py

--2024-05-04 23:04:27--  https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/Figure_8/Figure_8bc/logisticRegression.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14404 (14K) [text/plain]
Saving to: ‘logisticRegression.py’


2024-05-04 23:04:27 (21.5 MB/s) - ‘logisticRegression.py’ saved [14404/14404]



### Build models

For all models, negative trainig cells are selected such that they are of the same cell types as the positive training cells. All models are trained and tested using only the top 50% of cells in terms of sequencing depth to reduce the occurence of false viral absence labels. All models are generated for all 'macaque only' and 'shared' viruses.

In [4]:
# Define random seeds (based on https://www.kaggle.com/code/residentmario/kernel16e284dcb7)
random_seeds = [0, 1, 10, 42, 100, 1234]

In [None]:
%%time
for seed in random_seeds:
    # Run the regression on highly variable (hv) macaque genes with covariates (time point and donor animal)
    !python3 logisticRegression.py \
        --covariates_kind "donor_time" \
        --genes_kind "hv" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on highly variable macaque genes without covariates
    !python3 logisticRegression.py \
        --covariates_kind "none" \
        --genes_kind "hv" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on all macaque genes with covariates (time point and donor animal)
    !python3 logisticRegression.py \
        --covariates_kind "donor_time" \
        --genes_kind "all" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on all macaque genes without covariates
    !python3 logisticRegression.py \
    --covariates_kind "none" \
    --genes_kind "all" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed

    # Run the regression on all macaque genes with covariates (time point and donor animal)
    # but scramble the virus presence/absence labels as a negative control
    !python3 logisticRegression.py \
    --covariates_kind "donor_time" \
    --genes_kind "all" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed \
    --scramble True

    # Run the regression on highly variable macaque genes with covariates (time point and donor animal)
    # but scramble the virus presence/absence labels as a negative control
    !python3 logisticRegression.py \
    --covariates_kind "donor_time" \
    --genes_kind "hv" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed \
    --scramble True

# Zip the models into a compressed file
!zip models.zip *.pickle

USING EQUAL PROPORTIONS
Using all 'macaque only' and 'shared' viruses
u10supp
232 length of X, and sum:, 116.0
u1001supp
282 length of X, and sum:, 141.0
u10015supp
324 length of X, and sum:, 162.0
u10240supp
94 length of X, and sum:, 47.0
u11150supp
144 length of X, and sum:, 72.0
u27694supp
48396 length of X, and sum:, 24198.0
u34159supp
7916 length of X, and sum:, 3958.0
u39566supp
13864 length of X, and sum:, 6932.0
u100000supp
268 length of X, and sum:, 134.0
u100001supp
936 length of X, and sum:, 468.0
u100002supp
8630 length of X, and sum:, 4315.0
u100004supp
1238 length of X, and sum:, 619.0
u100007supp
816 length of X, and sum:, 408.0
u100011supp
1866 length of X, and sum:, 933.0
u100012supp
10352 length of X, and sum:, 5176.0
u100017supp
9276 length of X, and sum:, 4638.0
u100019supp
462 length of X, and sum:, 231.0
u100024supp
1062 length of X, and sum:, 531.0
u100026supp
1002 length of X, and sum:, 501.0
u100028supp
5378 length of X, and sum:, 2689.0
u100031supp
26978 lengt