<a href="https://colab.research.google.com/github/pachterlab/LSCHWCP_2023/blob/main/Notebooks/Figure_8/Figure_8bc/run_regressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic regressions to predict virus presence based on host gene expression

### NOTE: Running this notebook requires a large amount of memory (32 GB of RAM), exceeding Googla Colab memory limits.

In [None]:
!pip install -q anndata

### Download data
Download count matrices (generated [here](https://github.com/pachterlab/LSCHWCP_2023/tree/main/Notebooks/align_macaque_PBMC_data/1_virus_no_mask) (virus) and [here](https://github.com/pachterlab/LSCHWCP_2023/tree/main/Notebooks/Supp_Fig_3/Supp_Fig_3abc) (macaque):

In [None]:
# Download data from Caltech Data
!wget https://data.caltech.edu/records/sh33z-hrx98/files/virus_no_mask.h5ad?download=1
!mv virus_no_mask.h5ad?download=1 virus_no_mask.h5ad
!wget https://data.caltech.edu/records/sh33z-hrx98/files/macaque_QC_norm_leiden_celltypes.h5ad?download=1
!mv macaque_QC_norm_leiden_celltypes.h5ad?download=1 macaque_QC_norm_leiden_celltypes.h5ad

Download code to run logistic regressions:

In [None]:
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/Notebooks/Figure_8/Figure_8bc/logisticRegression.py

### Build models

For all models, negative trainig cells are selected such that they are of the same cell types as the positive training cells. All models are trained and tested using only the top 50% of cells in terms of sequencing depth to reduce the occurence of false viral absence labels. All models are generated for all 'macaque only' and 'shared' viruses.

In [None]:
# Define random seeds (based on https://www.kaggle.com/code/residentmario/kernel16e284dcb7)
random_seeds = [0, 1, 10, 42, 100, 1234]

In [None]:
%%time
for seed in random_seeds:
    # Run the regression on highly variable (hv) macaque genes with covariates (time point and donor animal)
    !python3 logisticRegression.py \
        --covariates_kind "donor_time" \
        --genes_kind "hv" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on highly variable macaque genes without covariates
    !python3 logisticRegression.py \
        --covariates_kind "none" \
        --genes_kind "hv" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on all macaque genes with covariates (time point and donor animal)
    !python3 logisticRegression.py \
        --covariates_kind "donor_time" \
        --genes_kind "all" \
        --regularization "l2" \
        --viruses_kind "supp" \
        --control "equalprop" \
        --matrix "halfM" \
        --random_seed $seed

    # Run the regression on all macaque genes without covariates
    !python3 logisticRegression.py \
    --covariates_kind "none" \
    --genes_kind "all" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed

    # Run the regression on all macaque genes with covariates (time point and donor animal)
    # but scramble the virus presence/absence labels
    !python3 logisticRegression.py \
    --covariates_kind "donor_time" \
    --genes_kind "all" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed \
    --scramble True

    # Run the regression on highly variable macaque genes with covariates (time point and donor animal)
    # but scramble the virus presence/absence labels
    !python3 logisticRegression.py \
    --covariates_kind "donor_time" \
    --genes_kind "hv" \
    --regularization "l2" \
    --viruses_kind "supp" \
    --control "equalprop" \
    --matrix "halfM" \
    --random_seed $seed \
    --scramble True

# Zip the models into a compressed file
!zip models.zip *.pickle