# Loading in BGEN files with Index Files

This notebook shows you how to load the BGEN files directly after the Hail Index files have been created (see previous notebook).

We assume:

- You have created your index files and they are in Project Storage (here in `/user/tladeras/index/`)
- You have created a file manifest that maps BGEN file locations in Bulk to the index file locations

Like always, we initialize Hail:

In [1]:
from pyspark.sql import SparkSession
import hail as hl
import hail.expr.aggregators as agg
import os

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
    .config("spark.shuffle.mapStatus.compression.codec", "lz4") 
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/cluster/dnax/jars/dnanexus-api-0.1.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/cluster/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


2023-05-03 15:16:53.745 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-05-03 15:16:53.925 WARN  MetricsReporter:84 - No metrics configured for reporting
2023-05-03 15:16:53.927 WARN  LineProtoUsageReporter:48 - Telegraf configurations: url [metrics.push.telegraf.hostport], user [metrics.push.telegraf.user] or password [metrics.push.telegraf.password] missing.
2023-05-03 15:16:53.927 WARN  MetricsReporter:117 - metrics.scraping.httpserver.port


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2023-05-03 15:16:55.506 WARN  Utils:69 - Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 43000. Attempting port 43001.


pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.2.0
SparkUI available at http://ip-10-60-179-41.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.108-48fb3a9bae04
LOGGING: writing to /home/dnanexus/hail-20230503-1516-0.2.108-48fb3a9bae04.log


Now we can load in our file manifest so that we can map our BGEN files to our created Index files. Here we assume that the manifest is in `/user/tladeras/`.

In [5]:
import dxpy, os
#import hail as hl
import pandas as pd
testing = True
project_dir = "/user/tladeras/"

manifest = pd.read_csv(f"file:///mnt/project" + project_dir + "bgen_manifest.csv")
manifest

Unnamed: 0.1,Unnamed: 0,bgen,sample,index,hdfs,index_ps
0,1,file:///mnt/project//Bulk/Exome sequences/Popu...,file:///mnt/project//Bulk/Exome sequences/Popu...,ukb23159_c11_b0_v1.bgen,hdfs:///index/ukb23159_c11_b0_v1.bgen.idx2,file:///mnt/project/user/tladeras/index/ukb231...
1,2,file:///mnt/project//Bulk/Exome sequences/Popu...,file:///mnt/project//Bulk/Exome sequences/Popu...,ukb23159_c1_b0_v1.bgen,hdfs:///index/ukb23159_c1_b0_v1.bgen.idx2,file:///mnt/project/user/tladeras/index/ukb231...
2,3,file:///mnt/project//Bulk/Exome sequences/Popu...,file:///mnt/project//Bulk/Exome sequences/Popu...,ukb23159_c4_b0_v1.bgen,hdfs:///index/ukb23159_c4_b0_v1.bgen.idx2,file:///mnt/project/user/tladeras/index/ukb231...
3,4,file:///mnt/project//Bulk/Exome sequences/Popu...,file:///mnt/project//Bulk/Exome sequences/Popu...,ukb23159_c16_b0_v1.bgen,hdfs:///index/ukb23159_c16_b0_v1.bgen.idx2,file:///mnt/project/user/tladeras/index/ukb231...


We'll recreate the mapping by cycling through our manifest row by row. Similarly, we extract the file list from the manifest. We only need one `.sample` file since they are identical (and `hl.load_bgen()` only takes a single sample file.

In [6]:
map_dict = {f"" + row["bgen"]:f"" + row["index_ps"] for i, row in manifest.iterrows()}
map_dict


{'file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c11_b0_v1.bgen': 'file:///mnt/project/user/tladeras/index/ukb23159_c11_b0_v1.bgen.idx2',
 'file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c1_b0_v1.bgen': 'file:///mnt/project/user/tladeras/index/ukb23159_c1_b0_v1.bgen.idx2',
 'file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c4_b0_v1.bgen': 'file:///mnt/project/user/tladeras/index/ukb23159_c4_b0_v1.bgen.idx2',
 'file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c16_b0_v1.bgen': 'file:///mnt/project/user/tladeras/index/ukb23159_c16_b0_v1.bgen.idx2'}

In [8]:
file_list = manifest["bgen"].tolist()
file_list

sample_list = manifest["sample"].tolist()[0]
sample_list

'file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, BGEN format - final release/ukb23159_c11_b0_v1.sample'

Now that our inputs are set up, we can now load in our BGEN files and do a `mt.count()` and whatever other filtering/QC/GWAS that we need to do.

In [9]:
#build index file dictionary    

#finally, import all bgen files
mt = hl.import_bgen(file_list,
                    entry_fields=['GT', 'GP'],
                    sample_file = sample_list,
                    n_partitions=None,
                    block_size=None,
                    variants=None,
                    index_file_map = map_dict)

2023-05-03 15:21:51.558 Hail: INFO: Number of BGEN files parsed: 4
2023-05-03 15:21:51.559 Hail: INFO: Number of samples in BGEN files: 469835
2023-05-03 15:21:51.559 Hail: INFO: Number of variants across all BGEN files: 6666112


In [10]:
mt.count()

(6666112, 469835)