## `aenmd` installation



####Step 1 Install `remotes` package.

In [9]:
if(!require(remotes)){
    install.packages("remotes")
}



#### Step 2: Install an aenmd data package.

Here we are using `aenmd.data.ensdb.v105` but other data packages are equally okauy.

In [10]:
remotes::install_github(repo = "kostkalab/aenmd_data",
                          subdir = "aenmd.data.ensdb.v105")



Skipping install of 'aenmd.data.ensdb.v105' from a github remote, the SHA1 (fea55c8a) has not changed since last install.
  Use `force = TRUE` to force installation



####Step 3: Install the `aenmd` package

In [11]:
remotes::install_github("kostkalab/aenmd")


Skipping install of 'aenmd' from a github remote, the SHA1 (a8f8a992) has not changed since last install.
  Use `force = TRUE` to force installation



##`aenmd` test drive (example)

###Load `vcf` file
The packe comes with ~1k randomly selected ClinVar variants, which we will annotate here. First, we read in the `vcf` file.

In [18]:
library(aenmd) #- automatically loads annotation data as well

#- load variants to annotate (1,000 random ClinVar variants)
vcf_file <- system.file('extdata/clinvar_20221211_noinfo_sample1k.vcf.gz', package = 'aenmd')
vcf      <- aenmd:::parse_vcf_VariantAnnotation(vcf_file)
vcf_rng  <- vcf$vcf_rng
vcf_rng[1:3]

Reading in vcf file using VariantAnnotation::readVcf...
 done.

 done.



GRanges object with 3 ranges and 5 metadata columns:
      seqnames        ranges strand | param_range_id                     ref
         <Rle>     <IRanges>  <Rle> |       <factor>          <DNAStringSet>
  [1]        1 940501-941150      * |             NA GGAGCCTGCA...CAGATCTCCT
  [2]        1 942504-942505      * |             NA                      CG
  [3]        1       1041417      * |             NA                       C
                 alt      qual      filter
      <DNAStringSet> <numeric> <character>
  [1]              G        NA           .
  [2]              C        NA           .
  [3]              T        NA           .
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

####Variant processing

Nex, we are filtering out variants that will not be assessed (splice variants) and variants that won't create PTCs.

In [19]:
vcf_rng_fil <- process_variants(vcf_rng)
vcf_rng_fil[1:3]

Processing variants.

Filtering out splice variants.

Filtering out variants that are not in coding sequence.

Filtering out snvs that don't create stop codons.



GRanges object with 3 ranges and 7 metadata columns:
      seqnames    ranges strand | param_range_id            ref            alt
         <Rle> <IRanges>  <Rle> |       <factor> <DNAStringSet> <DNAStringSet>
  [1]        1  12011551      * |             NA              C              T
  [2]        1  99902707      * |             NA              C              T
  [3]        1 115727014      * |             NA              C              A
           qual      filter        type             key
      <numeric> <character> <character>     <character>
  [1]        NA           .         snv 1:012011551|C|T
  [2]        NA           .         snv 1:099902707|C|T
  [3]        NA           .         snv 1:115727014|C|A
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

####Variant annotation
Finally, we annotate variantw with respect to NMD escape.
Here, we collect a `GRanges` object.

In [20]:
vcf_rng_ann <- annotate_nmd(vcf_rng_fil, rettype="gr")
vcf_rng_ann[1:3]

GRanges object with 3 ranges and 8 metadata columns:
                                  seqnames    ranges strand | param_range_id
                                     <Rle> <IRanges>  <Rle> |       <factor>
  ENST00000235329|1:012011551|C|T        1  12011551      * |             NA
  ENST00000294724|1:099902707|C|T        1  99902707      * |             NA
  ENST00000361915|1:099902707|C|T        1  99902707      * |             NA
                                             ref            alt      qual
                                  <DNAStringSet> <DNAStringSet> <numeric>
  ENST00000235329|1:012011551|C|T              C              T        NA
  ENST00000294724|1:099902707|C|T              C              T        NA
  ENST00000361915|1:099902707|C|T              C              T        NA
                                       filter        type             key
                                  <character> <character>     <character>
  ENST00000235329|1:012011551|C|T           

####Access `aenmd`'s annotations
We can access NMD escape annotations in a `DataFrame`, with rows corresponding to variant x transcript pairs encoded by the ranges in `vcf_rng_ann`.

In [21]:
vcf_rng_ann$res_aenmd


DataFrame with 142 rows and 7 columns
       is_ptc   is_last is_penultimate is_cssProximal is_single is_407plus
    <logical> <logical>      <logical>      <logical> <logical>  <logical>
1        TRUE      TRUE          FALSE          FALSE     FALSE      FALSE
2        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
3        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
4        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
5        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
...       ...       ...            ...            ...       ...        ...
138      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
139      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
140      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE
141     FALSE     FALSE          FALSE          FALSE     FALSE      FALSE
142     FALSE     FALSE          FALSE          FALSE     FALS