# Encode Imputation Main Coding Notebook

## Summary 
The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute. The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. Accordingly, a major activity of data production labs within ENCODE is the generation of data sets that enable characterization of various types of biochemical activity—transcription, histone modifications, chromatin accessibility, transcription factor binding, etc.—along the human genome. Performing the assays that measure these genomic features is expensive, and technical challenges may prevent complete characterization of particular cell types, so computational methods capable of predicting the outcome of these assays are potentially valuable. The goal of the ENCODE Imputation Challenge is to empirically compare methods for imputing data produced by various types of genomics assays. The challenge will be carried out in parallel with ENCODE’s ongoing data generation efforts, thereby allowing truly prospective validation of methods on newly acquired data sets.

In [4]:
# Loading modules
import synapseclient as syn
import pyBigWig as bw
import twobitreader as tbr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
# Playing around with BigWig
td_1 = bw.open('../data/encode_challenge_dat/training_data/C01M16.bigwig')
print(td_1.isBigWig())
print(td_1.isBigBed())

True
False


In [6]:
#Get chromosomes
td_1.chroms()

{'chr1': 248956422,
 'chr10': 133797422,
 'chr11': 135086622,
 'chr12': 133275309,
 'chr13': 114364328,
 'chr14': 107043718,
 'chr14_GL000009v2_random': 201709,
 'chr14_GL000194v1_random': 191469,
 'chr14_GL000225v1_random': 211173,
 'chr14_KI270722v1_random': 194050,
 'chr14_KI270723v1_random': 38115,
 'chr14_KI270724v1_random': 39555,
 'chr14_KI270725v1_random': 172810,
 'chr14_KI270726v1_random': 43739,
 'chr15': 101991189,
 'chr15_KI270727v1_random': 448248,
 'chr16': 90338345,
 'chr16_KI270728v1_random': 1872759,
 'chr17': 83257441,
 'chr17_GL000205v2_random': 185591,
 'chr17_KI270729v1_random': 280839,
 'chr17_KI270730v1_random': 112551,
 'chr18': 80373285,
 'chr19': 58617616,
 'chr1_KI270706v1_random': 175055,
 'chr1_KI270707v1_random': 32032,
 'chr1_KI270708v1_random': 127682,
 'chr1_KI270709v1_random': 66860,
 'chr1_KI270710v1_random': 40176,
 'chr1_KI270711v1_random': 42210,
 'chr1_KI270712v1_random': 176043,
 'chr1_KI270713v1_random': 40745,
 'chr1_KI270714v1_random': 41717,

In [12]:
#Get specific chromosome
print(td_1.chroms('chr1'))

248956422


In [11]:
#Get header for bigwig file
td_1.header()

{'version': 4,
 'nLevels': 10,
 'nBasesCovered': 3096013814,
 'minVal': 0,
 'maxVal': 54,
 'sumData': 1119160580,
 'sumSquared': 1119742727}

In [14]:
#Get basic stats on bigwid file 
td_1.stats('chr1', 1, 30000)

[0.23049324170436863]

In [15]:
#loading reference 2bit reference genome
ref_gen = tbr.TwoBitFile('../data/hg38.2bit')