<a href="https://colab.research.google.com/github/pachterlab/varseek-examples/blob/main/vk_ref.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [vk ref](https://github.com/pachterlab/varseek) demonstration
Create a variant reference index file with varseek ref, using a subset of 150 variants from the COSMIC Cancer Mutation Census sample file as the source of variants. Utilize gget to facilitate reference genome download.

Written by Joseph Rich.
___

### Install varseek, and import all packages

In [1]:
try:
    import varseek as vk
except ImportError:
    print("varseek not found, installing...")
    !pip install -U -q varseek

In [2]:
import os

import gget
import varseek as vk

### Define important paths and parameters
#### See more details in `vk ref --help`
- vk_ref_out_dir: output directory for all vk ref files (including the index and t2g that will subsequently be passed into vk count)
- reference_dir: Directory containing reference files, or the directory in which to download any reference files needed during vk ref. The only places where this is relevant within vk ref is when (a) using non-file path values for `variants` or `sequences` (see `vk ref --list_supported_databases`), or when `dlist_reference_source` is not None. We use this as the location for where to download our variants and sequences files in this tutorial.
- dlist_reference_source: The reference genome and transcriptome source used during variant reference filtering. See vk ref documantation for all supported values, and alternatives if desiring to use a reference not included in the internally supported list.
- variants: File containing the variants with the most common formats being either (1) a CSV file, each variant in a separate row and with columns provided by the arguments `seq_id_column` and `var_column`, or (2) a VCF file. See `vk ref --help` for all valid formats. For our example, we will use the COSMIC Cancer Mutation Census sample file. Learn more about COSMIC [here](https://cancer.sanger.ac.uk/cosmic).
- sequences: File containing the sequences in a fasta file corresponding to the annotations in the variants file. The genome assembly and release used to annotate the variants must be used. With the COSMIC Cancer Mutation Census, for instance, the transcript variants are annotated with the cDNA of the GRCh37 genome assembly, Ensembl release 93.

In [3]:
vk_ref_out_dir = os.path.join("data", "varseek_ref_out_sample")
reference_dir = os.path.join("data", "reference")
dlist_reference_source = "t2t"

variants = os.path.join(reference_dir, "cosmic", "cosmic_cmc_subset.csv")
sequences = os.path.join(reference_dir, "ensembl_grch37_release93", "Homo_sapiens.GRCh37.cdna.all.fa")

### Download the COSMIC Cancer Mutation Census sample file

In [4]:
if not os.path.exists(variants):
    cosmic_cmc_subset_url = "https://caltech.box.com/shared/static/444q8amwnyooduxbm1jvvdn6z694z49o.csv"
    vk.utils.download_box_url(cosmic_cmc_subset_url, output_folder=os.path.dirname(variants), output_file_name=os.path.basename(variants))

    # # for full cosmic:
    # gget.cosmic(
    #     None,
    #     grch_version=37,
    #     cosmic_version=101,
    #     out=reference_dir,
    #     cosmic_project="cancer_example",
    #     download_cosmic=True,
    #     gget_mutate=True,
    # )

### Print the first few lines of the COSMIC file
Note that our sequence IDs are in the column "seq_ID", and our variants are in the column "mutation"

In [5]:
!head -n 10 {variants}

seq_ID,mutation
ENST00000374213,c.184A>G
ENST00000374213,c.180A>G
ENST00000374222,c.1922G>C
ENST00000374222,c.2004T>C
ENST00000368716,c.413T>A
ENST00000368716,c.436G>A
ENST00000368716,c.368G>T
ENST00000368716,c.391C>A
ENST00000368716,c.383C>A


### Download the reference genome (GRCh37, Ensembl 93, cDNA file)

In [6]:
if not os.path.exists(sequences):
    sequences_dir = "." if not os.path.dirname(sequences) else os.path.dirname(sequences)
    !gget ref -w cdna -r 93 --out_dir {sequences_dir} -d human_grch37
    !gunzip {sequences}.gz

### Run varseek ref

This will run the following commands:
- `varseek build`: Create a variant-containing reference sequence (VCRS) fasta file, where each VCRS contains the variant and at most *w* base pairs flanking each side of the variant (optimized to ensure that no k-mer is shared between a VCRS and its non-variant counterpart in the reference genome). Also creates a t2g file for compatibility with varseek count.
- `varseek info`: Collect additional information about the VCRS sequences to be used for filtering. By default, it collects information on k-mer alignment between each VCRS and the T2T reference genome + transcriptome, VCRS pseudoalignment to the T2T reference genome (also indicating shared k-mers), and VCRS sequence complexity (i.e., the number of unique nucleotide triplets divided by the sequence length).
- `varseek filter`: Filter the VCRS fasta file and t2g file. By default, filters any VCRS that shares a k-mer with anywhere in the reference genome + transcriptome, any VCRS that pseudoaligns to the reference genome, and any VCRS with fewer than three unique triplets.
- `kb ref`: Create a kb index file from the filtered VCRS fasta file that can be used in varseek count.

In [7]:
vk_ref_output_dict = vk.ref(
    variants=variants,
    sequences=sequences,
    seq_id_column="seq_ID",  # specific to COSMIC Cancer Mutation Census transcriptome annotations, as referenced above
    var_column="mutation",  # specific to COSMIC Cancer Mutation Census transcriptome annotations, as referenced above
    out=vk_ref_out_dir,
    reference_out_dir=reference_dir,
    dlist_reference_source=dlist_reference_source
)

21:53:44 - INFO - Using COSMIC email from COSMIC_EMAIL environment variable
21:53:44 - INFO - Using COSMIC password from COSMIC_PASSWORD environment variable
21:53:44 - INFO - Running vk build
21:53:44 - INFO - Using COSMIC email from COSMIC_EMAIL environment variable: jmrich@caltech.edu
21:53:44 - INFO - Using COSMIC password from COSMIC_PASSWORD environment variable
21:53:46 - INFO - Using the seq_id_column:var_column 'seq_ID:mutation' columns as the variant header column.
21:53:46 - INFO - Removing 0 duplications > k
21:53:46 - INFO - Removed 0 variant-containing reference sequences with length less than 51...
21:53:46 - INFO - Removed 0 variant-containing reference sequences containing more than 0 'N's...
21:53:46 - INFO - All variants correctly recorded
21:53:46 - INFO - Merging rows of identical VCRSs
21:53:46 - INFO - 
        Number of variants total: 150
        Number of variants merged: 0
        Number of unique variants: 150
        Number of VCRSs: 150
        
21:53:46 -

Swapping complete


22:41:17 - INFO - Bowtie2 build complete
22:41:17 - INFO - Running bowtie2 alignment
22:41:17 - ERROR - Error aligning to normal genome and building alignment_to_reference: Command '['bowtie2', '-a', '-f', '-p', '2', '--xeq', '--score-min', 'C,0,0', '--np', '1', '--n-ceil', 'C,0,0', '-F', '51,1', '-R', '1', '-N', '0', '-L', '31', '-i', 'C,1,0', '--no-1mm-upfront', '--no-unal', '--no-hd', '-x', 'data/reference/t2t/bowtie_index_genome/data/reference/t2t/bowtie_index_genome/index', '-U', 'data/varseek_ref_out_sample/vcrs.fa', '-S', 'data/varseek_ref_out_sample/bowtie_vcrs_kmers_to_genome/alignment.sam']' returned non-zero exit status 2.
22:41:17 - INFO - Calculating longest homopolymer
22:41:18 - INFO - Calculating triplet stats
22:41:18 - INFO - sorting variant metadata by VCRS id
22:41:18 - INFO - Saving variant metadata
22:41:18 - INFO - Saved variant metadata to data/varseek_ref_out_sample/variants_updated_vk_info.csv
22:41:18 - INFO - Columns: Index(['vcrs_id', 'vcrs_sequence', 'vcrs

In [9]:
print(f"Find index at {vk_ref_output_dict['index']}")
print(f"Find t2g at {vk_ref_output_dict['t2g']}")

Find index at /Users/joeyrich/Desktop/local/varseek-examples/data/varseek_ref_out_sample/vcrs_index.idx
Find t2g at /Users/joeyrich/Desktop/local/varseek-examples/data/varseek_ref_out_sample/vcrs_t2g_filtered.txt
