# ENCODE on Azure Genomics Data Lake

Jupyter notebook is a great tool for data scientists who is working on Genomics data analysis. We will demonstrate usage of Encyclopedia of DNA Elements (ENCODE) data from Azure Open Datasets.

**Here is the coverage of this notebook:**

1. Getting the ENCODE data from Azure Open Dataset
2. Import the 'encode_file_manifest.tsv' to a table
3. Checking the count of specific files

**Dependencies:**

This notebook requires the following libraries:

- Azure storage `pip install azure-storage-blob==2.1.0`. Please visit [this page](https://github.com/Azure/azure-storage-python/wiki) for frequently encountered problem for this SDK.


- Technical note: [Explore Azure Genomics Data Lake with Azure Storage Explorer](https://github.com/microsoft/genomicsnotebook/blob/main/docs/Genomics_Data_Lake_Azure_Storage_Explorer.pdf)

**Important information: This notebook is using Python 3.6 kernel**


# 1. Getting the ENCODE data from Azure Open Dataset

Several public genomics data has been uploaded as an Azure Open Dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/). We create a blob service linked to this open datasets. You can find example of data calling procedure from Azure Open Dataset for `ENCODE` datasets in below:

**1.a.Install Azure Blob Storage SDK**

In [None]:
pip install azure-storage-blob==2.1.0

**1.b.Download the targeted file**

In [None]:
import os
import uuid
import sys
from azure.storage.blob import BlockBlobService, PublicAccess

blob_service_client = BlockBlobService(account_name='datasetencode', sas_token='?sv=2019-10-10&si=prod&sr=c&sig=9qSQZo4ggrCNpybBExU8SypuUZV33igI11xw0P7rB3c%3D')     
blob_service_client.get_blob_to_path('dataset', 'encode_file_manifest.tsv', './encode_file_manifest.tsv')

# 2. Import the 'encode_file_manifest.tsv' to a table

In [88]:
import pandas as pd

# read encode_file_manifest.tsv into a dataframe

metadata = pd.read_table('encode_file_manifest.tsv',sep='\t')

metadata.iloc[:,[1,2,3,4,5,10,11,12]]

Unnamed: 0,status,file_format,file_type,assembly,award.rfa,output_type,output_category,file_size
0,released,bigWig,bigWig,GRCh38,ENCODE,signal p-value,signal,6.206849e+08
1,released,bigWig,bigWig,GRCh38,ENCODE,plus strand signal of all reads,signal,6.236199e+08
2,released,bigWig,bigWig,GRCh38,ENCODE,signal p-value,signal,6.222111e+08
3,released,bigWig,bigWig,GRCh38,ENCODE,signal p-value,signal,6.442427e+08
4,released,bigWig,bigWig,GRCh38,ENCODE,signal p-value,signal,6.222841e+08
...,...,...,...,...,...,...,...,...
641541,released,bigWig,bigWig,hg19,ENCODE2,summed densities signal,signal,1.309956e+07
641542,released,bigWig,bigWig,hg19,ENCODE2,wavelet-smoothed signal,signal,1.015879e+07
641543,released,bigWig,bigWig,hg19,ENCODE2,wavelet-smoothed signal,signal,1.021096e+07
641544,released,bigWig,bigWig,hg19,ENCODE2,signal,signal,1.781798e+10


# 3. Checking the count of specific files

In [94]:
# let's take a quick look around

num_entries = len(metadata)

print("There are {} files in this dataset".format(num_entries))

num_ENCODE=metadata['award.rfa'].eq('ENCODE').sum()

print("There are {} ENCODE award.rfa files in this dataset".format(num_ENCODE))

num_ENCODE2=metadata['award.rfa'].eq('ENCODE2').sum()

print("There are {} ENCODE2 award.rfa files in this dataset:".format(num_ENCODE2))

num_hg19=metadata['assembly'].eq('hg19').sum()

print("There are {} hg.19 assembled files in this dataset".format(num_hg19))


num_GRCh38=metadata['assembly'].eq('GRCh38').sum()

print("There are {} GRCh38 assembled files in this dataset".format(num_GRCh38))


num_signal=metadata['output_type'].eq('signal p-value').sum()

print("There are {} signal p-value output type in this dataset".format(num_signal))


num_wavelet=metadata['output_type'].eq('wavelet-smoothed signal').sum()

print("There are {} wavelet-smoothed signal output type in this dataset".format(num_wavelet))


num_density=metadata['output_type'].eq('summed densities signal').sum()

print("There are {} summed densities signal output type in this dataset".format(num_density))



There are 641546 files in this dataset
There are 41423 ENCODE award.rfa files in this dataset
There are 22487 ENCODE2 award.rfa files in this dataset:
There are 163495 hg.19 assembled files in this dataset
There are 281223 GRCh38 assembled files in this dataset
There are 74579 signal p-value output type in this dataset
There are 142 wavelet-smoothed signal output type in this dataset
There are 16 summed densities signal output type in this dataset


# Reference

1. [ENCODE: Encyclopedia of DNA Elements](https://www.encodeproject.org)

