<a href="https://colab.research.google.com/github/pachterlab/varseek-examples/blob/main/vk_ref.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [vk ref](https://github.com/pachterlab/varseek) demonstration
Create a variant reference index file with varseek ref, using the COSMIC Cancer Mutation Census sample file as the source of variants. Utilize gget to facilitate reference genome and COSMIC variant downloads.

Written by Joseph Rich.
___

### Install varseek, and import all packages

In [1]:
try:
    import varseek as vk
except ImportError:
    print("varseek not found, installing...")
    !pip install -U -q varseek

In [2]:
import os

import gget
import varseek as vk

### Define important paths

In [3]:
vk_ref_out_dir = os.path.join("data", "varseek_ref_out")
reference_dir = os.path.join("data", "reference")

cosmic_file_path = os.path.join(reference_dir, "example_GRCh37", "CancerMutationCensus_AllData_v101_GRCh37_mutation_workflow.csv")
cds_file_path = os.path.join(reference_dir, "Homo_sapiens.GRCh37.cds.all.fa")

### Download the COSMIC Cancer Mutation Census sample file

In [15]:
gget.cosmic(
    None,
    grch_version=37,
    cosmic_version=101,
    out=reference_dir,
    mutation_class="cancer_example",
    download_cosmic=True,
)

11:47:13 - INFO - NOTE: Licence fees apply for the commercial use of COSMIC.
11:47:17 - INFO - Extracted tar file to data/reference/example_GRCh37
11:47:17 - INFO - Creating modified mutations file for use with gget mutate...
11:47:17 - INFO - Modified mutations file for use with gget mutate created at data/reference/example_GRCh37/CancerMutationCensus_AllData_v101_GRCh37_mutation_workflow.csv


### Print the first few lines of the COSMIC file
Note that our sequence IDs are in the column "seq_ID", and our variants are in the column "mutation"

In [19]:
!head -n 10 {cosmic_file_path}

seq_ID,mutation,mutation_aa,gene_name,mutation_id
ENST00000275493,c.650A>T,p.Q217L,EGFR,22513728
ENST00000275493,c.966C>T,p.G322=,EGFR,22493275
ENST00000275493,c.3458G>T,p.S1153I,EGFR,22496821
ENST00000275493,c.2239_2262del,p.L747_K754del,EGFR,22488435
ENST00000275493,c.3506A>G,p.N1169S,EGFR,22555598
ENST00000275493,c.2236_2248delinsCAAC,p.E746_A750delinsQP,EGFR,22504289
ENST00000275493,c.3354G>A,p.A1118=,EGFR,22533818
ENST00000275493,c.2317_2319dup,p.H773dup,EGFR,22487477
ENST00000275493,c.439G>A,p.A147T,EGFR,22498328


### Download the reference genome (GRCh37, Ensembl 93, CDS file)

In [22]:
!gget ref -w cds -r 93 --out_dir {reference_dir} -d human_grch37
!gunzip {cds_file_path}.gz

11:50:03 - INFO - Fetching reference information for homo_sapiens from Ensembl release: 93.
{
    "homo_sapiens": {
        "coding_seq_cds": {
            "ftp": "http://ftp.ensembl.org/pub/grch37/release-93/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.cds.all.fa.gz",
            "ensembl_release": 93,
            "release_date": "2015-11-27",
            "release_time": "20:17",
            "bytes": "19M"
        }
    }
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.0M  100 19.0M    0     0  8067k      0  0:00:02  0:00:02 --:--:-- 8070k


### Run varseek ref

This will run the following commands:
- `varseek build`: Create a variant-containing reference sequence (VCRS) fasta file, where each VCRS contains the variant and at most *w* base pairs flanking each side of the variant (optimized to ensure that no k-mer is shared between a VCRS and its non-variant counterpart in the reference genome). Also creates a t2g file for compatibility with varseek count.
- `varseek info`: Collect additional information about the VCRS sequences to be used for filtering. By default, it collects information on k-mer alignment between each VCRS and the T2T reference genome + transcriptome, VCRS pseudoalignment to the T2T reference genome (also indicating shared k-mers), and VCRS sequence complexity (i.e., the number of unique nucleotide triplets divided by the sequence length).
- `varseek filter`: Filter the VCRS fasta file and t2g file. By default, filters any VCRS that shares a k-mer with anywhere in the reference genome + transcriptome, any VCRS that pseudoaligns to the reference genome, and any VCRS with fewer than three unique triplets.
- `kb ref`: Create a kb index file from the filtered VCRS fasta file that can be used in varseek count.

In [4]:
vk_ref_output_dict = vk.ref(
    variants=cosmic_file_path,
    sequences=cds_file_path,
    seq_id_column="seq_ID",
    var_column="mutation",
    out=vk_ref_out_dir,
    reference_out_dir=reference_dir,
)

00:19:00 - INFO - Using COSMIC email from COSMIC_EMAIL environment variable: jmrich@caltech.edu
00:19:00 - INFO - Using COSMIC password from COSMIC_PASSWORD environment variable
00:19:00 - INFO - Running kb ref with command: kb ref --workflow custom -t 2 -i data/varseek_ref_out/vcrs_index.idx --d-list None -k 59 --overwrite data/varseek_ref_out/vcrs_filtered.fa
[2025-02-28 00:19:03,265]    INFO [ref_custom] Indexing data/varseek_ref_out/vcrs_filtered.fa to data/varseek_ref_out/vcrs_index.idx
[2025-02-28 00:19:04,688]    INFO [ref_custom] Finished creating custom index
00:19:04 - INFO - Total runtime for vk ref: 0m, 4.65s


In [6]:
print(f"Find index at {vk_ref_output_dict['index']}")
print(f"Find t2g at {vk_ref_output_dict['t2g']}")

Find index at data/varseek_ref_out/vcrs_index.idx
Find t2g at data/varseek_ref_out/vcrs_t2g_filtered.txt
