Kristen Abe

***Total: 42 points***

Complete this homework by writing R code to complete the following tasks. Keep in mind:

i. Empty chunks have been included where code is required
ii. This homework requires use of data files:

  - `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg` (Problems 1, 2)
  - `GIAB_highconf_v.3.3.2.vcf.gz` (Problem 3)
  
iv. You will be graded on your code and output results. The assignment is worth 42 points total; partial credit can be awarded.

For additional resources, please refer to these links:  
Problems 1 & 2:  
  - https://www.bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html
  - https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html  
Problem 3:  
  - https://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf  
Problem 4: 
  - https://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf  

# Problem 1: Overlaps between genomic regions and copy number alterations. (14 points total)

### Preparation
Load copy number segment results as shown in *2.1 BED format* of *Lecture16_GenomicData.Rmd*. You will use the same file as in the lecture notes, `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg`. Here is code to get you started.

In [159]:
#load packages
suppressPackageStartupMessages({
    library(tidyverse)
    library(GenomicRanges)
    library(plyranges)
    library(VariantAnnotation)
})

In [160]:
segs <- read.delim("BRCA.genome_wide_snp_6_broad_Level_3_scna.seg", as.is = TRUE)
mode(segs$Chromosome) <- "character" 
segs[segs$Chromosome == 23, "Chromosome"] <- "X"
segs.gr <- as(segs, "GRanges")
segs.gr

GRanges object with 284458 ranges and 3 metadata columns:
           seqnames              ranges strand |                 Sample
              <Rle>           <IRanges>  <Rle> |            <character>
       [1]        1    3218610-95674710      * | TCGA-3C-AAAU-10A-01D..
       [2]        1   95676511-95676518      * | TCGA-3C-AAAU-10A-01D..
       [3]        1  95680124-167057183      * | TCGA-3C-AAAU-10A-01D..
       [4]        1 167057495-167059336      * | TCGA-3C-AAAU-10A-01D..
       [5]        1 167059760-181602002      * | TCGA-3C-AAAU-10A-01D..
       ...      ...                 ...    ... .                    ...
  [284454]       19     284018-58878226      * | TCGA-Z7-A8R6-01A-11D..
  [284455]       20     455764-62219837      * | TCGA-Z7-A8R6-01A-11D..
  [284456]       21   15347621-47678774      * | TCGA-Z7-A8R6-01A-11D..
  [284457]       22   17423930-49331012      * | TCGA-Z7-A8R6-01A-11D..
  [284458]        X   3157107-154905589      * | TCGA-Z7-A8R6-01A-11D..
      

### a. Find the segments in `segs.gr` that have *any* overlap with the region `chr8:128,746,347-128,755,810` (4 points)
Print out the first five unique TCGA IDs.

In [161]:
chr8 <- GRanges(seqnames = "8", ranges = IRanges(start = 128746347, end = 128755810))

segs.overlap <- find_overlaps(segs.gr, chr8)  
segs.overlap$Sample[1:5]

### b. Find the mean of the `Segment_Mean` values for copy number segments that have *any* overlap with the region chr17:37,842,337-37,886,915. (4 points)

In [162]:
chr17 <- GRanges(seqnames = "17", ranges = IRanges(start = 37842337, end = 37886915))

segs.overlap2 <- find_overlaps(segs.gr, chr17)  
mean(segs.overlap2$Segment_Mean)

### c. Find the patient sample distribution of copy number for `PIK3CA` (hg19). (6 points)
Find the counts of samples with deletion (D; `Segment_Mean < -0.3`), neutral (N; `Segment_Mean >= -0.3 & Segment_Mean <= 0.3`), gain (G; `Segment_Mean > 0.3`) segments that have `any` overlap with `PIK3CA` gene coordinates.  


In [163]:
PIK3CA <- GRanges(seqnames = "3", ranges = IRanges(start = 178866145, end = 178957881))
PIK3A.overlap <- find_overlaps(segs.gr, PIK3CA) 

deletion <- PIK3A.overlap$Segment_Mean[PIK3A.overlap$Segment_Mean <= -0.3]
paste("Deletion counts:", length(deletion))

neutral <- PIK3A.overlap$Segment_Mean[PIK3A.overlap$Segment_Mean >= -0.3 | PIK3A.overlap$Segment_Mean <= 0.3]
paste("Neutral counts:", length(neutral))

gain <- PIK3A.overlap$Segment_Mean[PIK3A.overlap$Segment_Mean > 0.3]
paste("Gain counts:", length(gain))

# Problem 2: Frequency of copy number alteration events within genomic regions. (12 points total) 

This problem will continue to use the copy number data stored in `segs.gr`.

### a. Create a genome-wide tile of 1Mb windows for the human genome (`hg19`). (6 points)
See *3.1 Tiling the genome* of *Lecture16_GenomicData.Rmd* for hints.


In [164]:
seqinfo <- Seqinfo(genome = "hg19")
seqinfo <- keepStandardChromosomes(seqinfo) 
seqlevelsStyle(seqinfo) <- "NCBI"
seqinfo

slen <- seqlengths(seqinfo) 
tileWidth <- 1000000
tiles <- tileGenome(seqlengths = slen, tilewidth = tileWidth,
                    cut.last.tile.in.chrom = TRUE)
tiles

“cannot switch some of hg19's seqlevels from UCSC to NCBI style”


Seqinfo object with 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19):
  seqnames seqlengths isCircular     genome
  1         249250621      FALSE GRCh37.p13
  2         243199373      FALSE GRCh37.p13
  3         198022430      FALSE GRCh37.p13
  4         191154276      FALSE GRCh37.p13
  5         180915260      FALSE GRCh37.p13
  ...             ...        ...        ...
  21         48129895      FALSE GRCh37.p13
  22         51304566      FALSE GRCh37.p13
  X         155270560      FALSE GRCh37.p13
  Y          59373566      FALSE GRCh37.p13
  chrM          16571       TRUE       hg19

GRanges object with 3114 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]        1         1-1000000      *
     [2]        1   1000001-2000000      *
     [3]        1   2000001-3000000      *
     [4]        1   3000001-4000000      *
     [5]        1   4000001-5000000      *
     ...      ...               ...    ...
  [3110]        Y 56000001-57000000      *
  [3111]        Y 57000001-58000000      *
  [3112]        Y 58000001-59000000      *
  [3113]        Y 59000001-59373566      *
  [3114]     chrM           1-16571      *
  -------
  seqinfo: 25 sequences from an unspecified genome

### b. Find the 1Mb window with the most frequent overlapping deletions. (6 points)
Find the 1Mb windows with `any` overlap with deletion copy number segments. Assume a deletion segment is defined as a segment in `segs.gr` having `Segment_Mean < -0.3`. 

Return one of the 1Mb window `Granges` entry with the highest frequency (count) of deletion segments.

Hint: Subset the `segs.gr` to only rows with `Segment_Mean < -0.3`. 

In [165]:
deletions <- segs.gr[segs.gr$Segment_Mean < -0.3]

deletions.overlap <- countOverlaps(tiles, deletions)

deletions.overlap[which.max(deletions.overlap)]

# Problem 3: Reading and annotating genomic variants (16 points total)

### Preparation

In [166]:
vcfFile <- "GIAB_highconf_v.3.3.2.vcf.gz"

### a. Load variant data from VCF file `GIAB_highconf_v.3.3.2.vcf.gz` for `chr8:128,700,000-129,000,000`. (4 points)
Note: use genome build `hg19`.

In [167]:
vcfHead <- scanVcfHeader(vcfFile)
myGRange4 <- GRanges(seqnames = "8", ranges = IRanges(start = 128700000, end = 129000000))
vcf.param <- ScanVcfParam(which = myGRange4) 
vcf <- readVcf(vcfFile, genome = "hg19", param = vcf.param)
rowRanges(vcf)

GRanges object with 308 ranges and 5 metadata columns:
                                seqnames              ranges strand |
                                   <Rle>           <IRanges>  <Rle> |
                      rs6984323        8           128706908      * |
                      rs4478537        8           128708943      * |
                     rs34141920        8 128710237-128710239      * |
                     rs17772814        8           128711742      * |
                     rs77977256        8           128713029      * |
                            ...      ...                 ...    ... .
                     rs10808563        8           128996845      * |
                     rs71300287        8 128997083-128997091      * |
              8:128997155_CTT/C        8 128997155-128997157      * |
  8:128997161_TTCTTTCTCTTTCTC/T        8 128997161-128997175      * |
                      rs2392884        8           128999174      * |
                                par

### b. Combine the fields of the VCF genotype information into a table. (4 points)
You may use your choice of data objects (e.g. `data.frame`).

In [168]:
genoData <- bind_cols(lapply(geno(vcf), as.data.frame))
colnames(genoData) <- rownames(geno(header(vcf)))
genoData

[1m[22mNew names:
[36m•[39m `HG001` -> `HG001...1`
[36m•[39m `HG001` -> `HG001...2`
[36m•[39m `HG001` -> `HG001...3`
[36m•[39m `HG001` -> `HG001...4`
[36m•[39m `HG001` -> `HG001...5`
[36m•[39m `HG001` -> `HG001...6`
[36m•[39m `HG001` -> `HG001...7`
[36m•[39m `HG001` -> `HG001...8`


Unnamed: 0_level_0,GT,DP,GQ,ADALL,AD,IGT,IPS,PS
Unnamed: 0_level_1,<chr>,<int>,<int>,<named list>,<named list>,<chr>,<chr>,<chr>
rs6984323,1|1,765,583,"1, 332","0, 315",1/1,.,PATMAT
rs4478537,0|1,544,813,"103, 124","135, 172",0/1,.,PATMAT
rs34141920,0|1,523,222,"132, 121","132, 121",0/1,.,PATMAT
rs17772814,1|0,695,1503,"143, 158","196, 199",0/1,.,PATMAT
rs77977256,1|0,642,685,"154, 157","160, 166",0/1,.,PATMAT
8:128715845_AT/A,0|1,368,99,"66, 91","66, 91",0/1,.,PATMAT
rs143209301,1|0,581,595,"128, 128","151, 165",0/1,.,PATMAT
rs202231913,0|1,369,99,"81, 97","81, 97",0/1,.,PATMAT
rs16902340,0|1,689,1294,"144, 150","184, 204",0/1,.,PATMAT
rs7841229,0|1,635,1010,"180, 172","134, 130",0/1,.,PATMAT


### c. Retrieve the following information at chr8:128747953. (8 points)
Print out the SNP ID (i.e. "rs ID"), reference base (`REF`), alterate base (`ALT`), genotype (`GT`), depth (`DP`), allele depth (`ADALL`), phase set (`PS`).

Hints: 

  i. `REF` and `ALT` are in the output of `rowRanges(vcf)`. See Section `3a` in `Lecture16_VariantCalls.ipynb` 
  ii. To get the sequence of `DNAString`, use `as.character(x)`.  
  ii. To get the sequence of `DNAStringSet`, use `as.character(unlist(x))`. 
  iii. To expand a list of information for `geno`, use `unlist(x)`.  

  

In [169]:
variant_data <- as.data.frame(rowRanges(vcf))
match <- which(variant_data$start == 128747953)
variant_data_match <- df[match,]

genoData <- as.data.frame(genoData)
genoData_match <- genoData[match,]

combined_data <- bind_cols(variant_data_match, genoData_match)
answer <- combined_data[, -c(1,2,3,4,5,6,9,10,13,15,16,17)]
answer

Unnamed: 0_level_0,REF,ALT,GT,DP,ADALL,PS
Unnamed: 0_level_1,<chr>,<I<list>>,<chr>,<int>,<named list>,<chr>
rs3824120,G,T,0|1,461,"105, 94",PATMAT
