<a href="https://colab.research.google.com/github/pachterlab/varseek-examples/blob/main/vk_count.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [vk count](https://github.com/pachterlab/varseek) demonstration
Perform variant screening on scRNA-seq data with vk count, using a 10x PBMC 1k dataset as an example.

Written by Joseph Rich.
___

### Install varseek, and import all packages

In [12]:
try:
    import varseek as vk
except ImportError:
    print("varseek not found, installing...")
    !pip install -U -q varseek

In [13]:
import os
import shutil

import varseek as vk

### Define important paths

In [None]:
vk_count_out_dir = os.path.join("data", "varseek_count_out")
vk_ref_out_dir = os.path.join("data", "varseek_ref_out")

vcrs_index = os.path.join(vk_ref_out_dir, "vcrs_index.idx")
vcrs_t2g = os.path.join(vk_ref_out_dir, "vcrs_t2g_filtered.txt")
fastqs_dir = os.path.join("data", "pbmc_1k_v3_fastqs")
technology = "10xv3"
kb_count_reference_genome_dir = os.path.join("data", "kb_count_reference_genome")

### Run the notebook vk_ref.ipynb to generate the reference index and t2g files. Alternatively, download them from Box if they do not already exist.

In [None]:
if not os.path.exists(vcrs_index):
    vcrs_index_url = "https://caltech.box.com/shared/static/8693b78lh02fv8qh6wz6keng7cn2n91k.idx"
    vk.utils.download_box_url(vcrs_index_url, output_folder=vk_ref_out_dir, output_file_name="vcrs_index.idx")

if not os.path.exists(vcrs_t2g):
    vcrs_t2g_url = "https://caltech.box.com/shared/static/0svv7xx0mobhfzpiz7f48bjljhs72kcy.txt"
    vk.utils.download_box_url(vcrs_t2g_url, output_folder=vk_ref_out_dir, output_file_name="vcrs_t2g_filtered.txt")

File downloaded successfully to data/varseek_ref_out2/vcrs_index.idx
File downloaded successfully to data/varseek_ref_out2/vcrs_t2g_filtered.txt


### Download the PBMC fastq dataset

In [None]:
if not os.path.exists(fastqs_dir) or len(os.listdir(fastqs_dir)) == 0:
    !mkdir -p data && \
        cd data && \
        curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar && \
        tar -xvf pbmc_1k_v3_fastqs.tar && \
        rm pbmc_1k_v3_fastqs.tar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5292M  100 5292M    0     0  13.8M      0  0:06:21  0:06:21 --:--:-- 15.1M 0:06:56 12.6M06:05  0:00:16  0:05:49 14.2M 14.6M      0  0:06:01  0:00:18  0:05:43 14.5M05:15 15.6M 14.9M      0  0:05:53  0:00:45  0:05:08 15.2M05  0:04:45 15.4M   0  15.0M      0  0:05:52  0:01:32  0:04:20 14.8M   0  15.0M      0  0:05:51  0:01:35  0:04:16 15.8M15.0M      0  0:05:50  0:03:06  0:02:44 15.1M  0     0  15.1M      0  0:05:50  0:03:58  0:01:52 15.0M0     0  15.1M      0  0:05:50  0:04:12  0:01:38 15.0M  0  15.1M      0  0:05:50  0:04:19  0:01:31 14.6M   0  15.0M      0  0:05:52  0:04:32  0:01:20 11.7M   0     0  14.3M      0  0:06:07  0:05:08  0:00:59 10.5M 0     0  14.1M      0  0:06:14  0:05:30  0:00:44 11.9M   0     0  13.9M      0  0:06:18  0:05:45  0:00:33 11.9M   0     0  13.9M      0  0:06:20  0:05:55  0:00:25 11.2M    0     0  13.8

### (Recommended): Process the fastq data - as an example, we will use [fastp](https://github.com/OpenGene/fastp)

In [None]:
if shutil.which("fastp"):
    fastqs_filtered_dir = os.path.join(fastqs_dir, "filtered")
    for file_r1, file_r2 in [("pbmc_1k_v3_S1_L001_R1_001.fastq.gz", "pbmc_1k_v3_S1_L001_R2_001.fastq.gz"), ("pbmc_1k_v3_S1_L002_R1_001.fastq.gz", "pbmc_1k_v3_S1_L002_R2_001.fastq.gz")]:
        file_r1_path = os.path.join(fastqs_dir, file_r1)
        file_r2_path = os.path.join(fastqs_dir, file_r2)

        file_r1_out_path = os.path.join(fastqs_filtered_dir, file_r1)
        file_r2_out_path = os.path.join(fastqs_filtered_dir, file_r2)
        
        print(f"Processing {file_r1} and {file_r2} with fastp")
        !fastp -i {file_r1_path} -I {file_r2_path} -o {file_r1_out_path} -O {file_r2_out_path} --cut_front --cut_tail --cut_window_size 4 --cut_mean_quality 20 -h {out_dir}/fastp_report.html -j {out_dir}/fastp_report.json --qualified_quality_phred 15 --unqualified_percent_limit 40 --n_base_limit 5 --length_required 31
    
    fastqs_dir = fastqs_filtered_dir
else:
    print("fastp is not installed. Skipping fastq pre-processing")

### (Recommended): Pseudoalign the FASTQ data to the reference genome - helps with adata processing in varseek clean
This can be created internally to varseek count with the reference_genome_index and reference_genome_t2g arguments, but it is nice to have it outside

In [None]:
reference_genome_index = os.path.join(kb_count_reference_genome_dir, "index.idx")  # either already exists or will be created
reference_genome_t2g = os.path.join(kb_count_reference_genome_dir, "t2g.txt")  # either already exists or will be created

species = "human"  # if reference_genome_index/reference_genome_t2g do not exist, then I need to supply either (1) species or (2) reference genome fasta and gtf
reference_genome_fasta = ""
reference_genome_gtf = ""

# check if kb count was run
if not os.path.exists(kb_count_reference_genome_dir) or len(os.listdir(kb_count_reference_genome_dir)) == 0:
    # check if kb ref was run
    if not os.path.exists(reference_genome_index) or not os.path.exists(reference_genome_t2g):
        if species:
            !kb ref -i {reference_genome_index} -g {reference_genome_t2g} -d {species}
        else:
            reference_genome_f1 = os.path.join(kb_count_reference_genome_dir, "f1.fa")
            !kb ref -t 2 -i {reference_genome_index} -g {reference_genome_t2g} -f1 {reference_genome_f1} {reference_genome_fasta} {reference_genome_gtf}
            !rm {reference_genome_f1}
    
    !kb count -t 2 -i {reference_genome_index} -g {reference_genome_t2g} -x {technology} --h5ad --num -o {kb_count_reference_genome_dir} {fastqs_dir}/pbmc_1k_v3_S1_L001_R1_001.fastq.gz {fastqs_dir}/pbmc_1k_v3_S1_L001_R2_001.fastq.gz {fastqs_dir}/pbmc_1k_v3_S1_L002_R1_001.fastq.gz {fastqs_dir}/pbmc_1k_v3_S1_L002_R2_001.fastq.gz

### Run varseek count

This will run the following commands:
- `varseek fastqpp`: Preprocess the fastq files. By default, does nothing.
- `kb count` (variant reference): Perform variant screening on fastq data utilizing kb count's pseudoalignment algorithm. Variant data is stored in an Anndata object as as cell/sample x variant matrix.
- `kb count` ("normal" reference genome) (optional): Perform pseudoalignment of fastq data to the reference genome. Only performed if utilized in the subsequence "varseek clean" step. By default, will occur if the path to the necessary files are not provided as input.
- `varseek clean`: Process the output of kb count. By default, this will threshold variant counts and ensure that there is agreement for each read between the gene of the variant to which the read aligned during the variant reference pseudoalignment and the gene to which the read aligned during the "normal" reference genome pseudoalignment.
- `varseek summarize`: Produces a text file summarizing some high-level insights from the variant screening process.

In [None]:
vk_count_output_dict = vk.count(
    fastqs_dir,
    index=vcrs_index,
    t2g=vcrs_t2g,
    technology=technology,
    out=vk_count_out_dir,
    kb_count_reference_genome_dir=kb_count_reference_genome_dir,
    qc_against_gene_matrix=False,  #!!! uncomment when working
)

In [None]:
print(f"Find summarized results in {vk_count_output_dict['vk_summarize_output_dir']}")
print(f"Find the processed adata object for further analysis in {vk_count_output_dict['adata_path']}")