# Vignette of Pypi-Package hapROH: Calling ROH from an Eigenstrat File
This notebook contains an example application of `hapROH`:
We will identify ROH using an Eigenstrat file as input. This notebook can serve as a blueprint for your own ROH inference.

You will see how to run `hapROH` on single chromosomes, on whole individuals, and sets of individuals, and how to post-process multiple output files into one summary table (including metadata)

To learn how to visualize the results, take a look at `plotROH_vignette.ipynb`.

@author: Harald Ringbauer

In [1]:
### First some Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os as os
import sys as sys
import multiprocessing as mp
print(f"CPU Count: {mp.cpu_count()}")

### If you want to use a version other than the default installed hapROH 
### uncomment the following and set the path to the correct (installed) package
#sys.path.insert(0,"/mnt/archgen/users/hringbauer/git/hapROH/package/")  # Uncomment to get local package first in path

CPU Count: 128


# Set the Path
You can set the path here to the desired working directory (relative paths below will be resolved relative to this folder). Alternatively, you can give absolute paths in the next steps.

In [2]:
### Fill in your own path here:
path = "/mnt/archgen/users/hringbauer/git/hapROH/Notebooks/Vignettes/"  # The Path to Package Midway Cluster

os.chdir(path)  # Set the right Path (in line with Atom default)
print(f"Set path to: {os.getcwd()}") # Show the current working directory. Should be HAPSBURG/Notebooks/ParallelRuns

Set path to: /mnt/archgen/users/hringbauer/git/hapROH/Notebooks/Vignettes


# 1) Download the data we will use as an example here
If you don't already have it, you will need to obtain the target and reference data. Here is how:

### Get Target Data
We will download an example eigenstrat available via the Reich lab homepage. 
It contains a handful of individuals from the Chalcolithic Levant (zipped, ~20 MB).
You can download it from:
`https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/Levant_ChL.tar.gz`
Download and unpack the files into the desired folder (unpacking will create .geno, .ind, and .snp files).

### Get Reference Data
You can download the reference data from 
`https://www.dropbox.com/s/0qhjgo1npeih0bw/1000g1240khdf5.tar.gz?dl=0`

This data is key for running hapROH - it is the 1000 Genome data downsampled to bi-allelic 1240k SNPs, from 5008 global reference haplotypes (Note: the file is ~800 MB in size). 

There is also a metafile datatable that contains information about the reference panel. This table is necessary to run hapROH; do not forget to copy it over as well, in addition to the data for each chromosome.

# 2) Call ROH
Now we will run the core part of the package, a wrapper function for the 
core ROH calling machinery - `callROH_ind()`.

We first import it:

In [3]:
from hapROH.run import callROH_ind

## Test calling ROH on a single chromosome from an Individual, with full output printed
Set the location of the reference HDF5 and the target eigenstrat path correctly!

Input: Reference haplotype file (hdf5, containing genetic map), reference metafile, pseudo-haploid Eigenstrat
Output: Output file in folder `folder_out`

See example below, you need to set `path_targets` and `h5_path1000G`

logfile=False -> All output is printed into sys.out  
combine=False -> No individual .csv is created from chromosome .csv files

### Minimal Version of `callROH_ind()`
A version where the parameters a normal user does not have to set hidden (i.e., set to default values that work well for 1240K data).
hapROH is tested for a wide variety of human aDNA data, and the default parameters likely have you covered if you run hapROH on <50 ky old 1240K data.

This function will call ROH of individual `iid` in the target eigenstrat given in `path_targets`. We also set the path to the reference HDF5 file (`h5_path1000G`) and the metafile of the reference (`meta_path_ref`), as well as the folder in which to save the output (`folder_out`).

Let us run the date for chromosome 20 (`range(20, 21)` selects a list of length 1: `[20]`). We will use one processor (`processes=1`) and print all text output (`output=True`) to the console (`logfile=False`). 

In [4]:
callROH_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr", # The path up to the chr. number
          meta_path_ref="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/meta_df_all.csv", 
          folder_out='./ExampleOutput/',  # Folder where you want to save the results to 
          processes=1, verbose=True,
          readcounts=False, logfile=False, combine=False)

Doing Individual I1178...
Using Rescaled HMM.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from /mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr20.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Reduced to markers with data: 19815 / 29053
Fraction SNPs covered: 0.6820
Extraction of 5008 reference haplotypes at 19815 sites complete
Flipping Ref/Alt Alleles in target for 0 SNPs...
Successfully saved target individual data to: ./ExampleOutput/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./ExampleOutput/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: model
Loaded Post Processi

### Power version of `callROH_ind` [does the same as above, but shows more parameters that could be changed]
Shows most of the parameters that you can set. 

Warning: Try to understand what you are doing when changing parameters. For example, the transition parameters are already optimized for 1240k data; changing them can have unintended consequences.

All parameters are explained in the docstring of `callROH_ind()`.

In [5]:
callROH_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr", # The path up to the chr. number
          meta_path_ref="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/meta_df_all.csv", 
          folder_out='./ExampleOutput/', prefix_out='', 
          e_model='haploid', p_model='Eigenstrat', post_model='Standard', 
          low_mem=True, processes=1, delete=False, verbose=True, save=True, save_fp=False, 
          n_ref=2504, diploid_ref=True, exclude_pops=[], readcounts=False, random_allele=True,
          c=0.0, conPop=["CEU"], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.00, 
          cutoff_post = 0.999, max_gap=0.005, roh_min_l_initial = 0.02, roh_min_l_final = 0.04,
          min_len1 = 0.02, min_len2 = 0.04, logfile=False, combine=False, file_result="_roh_full.csv")

Doing Individual I1178...
Using Rescaled HMM.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from /mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr20.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Reduced to markers with data: 19815 / 29053
Fraction SNPs covered: 0.6820
Extraction of 5008 reference haplotypes at 19815 sites complete
Flipping Ref/Alt Alleles in target for 0 SNPs...
Successfully saved target individual data to: ./ExampleOutput/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./ExampleOutput/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: model
Loaded Post Processi

### Example run of whole individual (all chromosomes) with output to logfile
Now let us run a whole individual in parallel (for each chromosome), with the output sent to a log file.

Attention: `hapROH` requires a non-trivial amount of RAM. For an Eigenstrat, it can spike up to a few GB for a long chromosome with most SNPs covered. Allocate memory accordingly when running multiple processes, or set the number of parallel processes to a lower value.

In [6]:
callROH_ind(iid="I1178", chs=range(1,23), processes=6, 
          path_targets='./ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr", # The path up to the chr. number
          meta_path_ref="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/meta_df_all.csv",
          folder_out='./ExampleOutput/', prefix_out='', 
          e_model="haploid", p_model="Eigenstrat",
          random_allele=True, readcounts=False,
          delete=False, logfile=True, combine=True)

Doing Individual I1178...
Set Output Log path: ./ExampleOutput/I1178/chr6/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr5/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr4/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr3/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr2/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr1/hmm_run_log.txt





Combining Information for 22 Chromosomes...


### Run multiple Individuals
Now, let's loop over individuals. In practice, I run it as a parallelized sbatch job (setting processes to 1 there) - this way, you can parallelize it with one process per individual and not worry about simultaneous memory spikes. That is a recommended level of parallelization; the individual output files will be combined in a subsequent step.

In [8]:
### This are all individuals with 400k SNPs covered
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

for iid in iids:
    print(f"Doing Individual: {iid}")
    callROH_ind(iid=iid, chs=range(1,23), processes=6, 
              path_targets='./ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
              h5_path1000g="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/1240k_binary_chr", # The path up to the chr. number
              meta_path_ref="/mnt/archgen/public_data/hapROH_refpanel/1240k_binary/meta_df_all.csv",
              folder_out='./ExampleOutput/', prefix_out='', 
              e_model="haploid", p_model="Eigenstrat", n_ref=2504,
              random_allele=True, readcounts=False,
              delete=False, logfile=True, combine=True)

Doing Individual: I1178
Doing Individual I1178...
Set Output Log path: ./ExampleOutput/I1178/chr1/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr2/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr3/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr4/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr6/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I1178/chr5/hmm_run_log.txt





Combining Information for 22 Chromosomes...
Doing Individual: I0644
Doing Individual I0644...
Set Output Log path: ./ExampleOutput/I0644/chr1/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I0644/chr4/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I0644/chr3/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I0644/chr6/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I0644/chr2/hmm_run_log.txtSet Output Log path: ./ExampleOutput/I0644/chr5/hmm_run_log.txt





Combining Information for 22 Chromosomes...
Doing Individual: I1160
Doing Individual I11

## 3) Postprocess Results into a single results.csv (optionally, copying in Meta Data)
Take individual output .csv files and combine them into one main results .csv file.

**Important**: Merging of ROH gaps, and filtering to ROH>x cM happens here. This is a key step when interpreting the output!

In [3]:
#from hapsburg.PackagesSupport.pp_individual_roh_csvs import pp_individual_roh
from hapROH.run import ppROH_inds

### Create Example Meta File
This file will create a minimal meta data file (some plotting functions use individual meta information, such as age),
which is then merged into the results when combining individuals.

If you have metadata available for your dataset, you can prepare a comma seperated csv table with these headers!

In [4]:
iids = ['I1178', 'I0644', 'I1160', 
        'I1152', 'I1168', 'I1166', 
        'I1170', 'I1165', 'I1182']
df = pd.DataFrame({"iid":iids})

df["clst"] = "Israel_C"  # The clst column is necessary - it sets the population label column
df["age"] = 5950
df["lat"] = 32.974167
df["lon"] = 35.331389

df.to_csv("./ExampleData/meta_blank.csv", 
          sep=",", index=False)

### Combine individual output files into a single ROH output File
Combines meta file, individual ROH output files
into a summary table, contining statistical information about ROH for each indiviual as a row.

This is the table that a lot of the downstream plotting software uses.

Creating this table also does additional post-processing of the ROH, in particular merging gaps between ROH. Every gap of length gap [in cM] between two ROH of >min_len1 and >min_len2 is merged.

This is by default already done in `hapsb_ind`, but can be also done here. Doing it only in the end is useful e.g. in cases where one wants to try different post-processing and gap-merging: One can set the gap to 0 in hapsb_ind and then here try different values.

In [4]:
### Postprocess the two Individuals from above and combine into one results .csv
iids = ['I1178', 'I0644', 'I1160', 'I1152', 
        'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

df1 = ppROH_inds(iids, meta_path="./ExampleData/meta_blank.csv",  
                 base_folder="./ExampleOutput/",
                 save_path="./ExampleOutput/combined_roh05.csv", 
                 verbose=False, min_cm=[4, 8, 12, 20], snp_cm=50, 
                 gap=0.5, min_len1=2.0, min_len2=4.0)

Loaded 9 / 9 Individuals from Meta
Saved to: ./ExampleOutput/combined_roh05.csv


### [Alternative Post-processing]
You do not strictly require creating a meta table. You can only provide the list of iids.
Note that this will set the population column to a default value for all iids, which affects downstream plotting.
You can, of course, manually update this column.

In [5]:
iids = ['I1178', 'I0644', 'I1160', 'I1152', 
        'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

df2 = ppROH_inds(iids, base_folder="./ExampleOutput/", meta_info=False,
                 save_path="", meta_path="", verbose=False, min_cm=[4, 8, 12, 20], snp_cm=50, 
                 gap=0.5, min_len1=2.0, min_len2=4.0)

### Show the output table

In [6]:
df1

Unnamed: 0,max_roh,iid,pop,sum_roh>4,n_roh>4,sum_roh>8,n_roh>8,sum_roh>12,n_roh>12,sum_roh>20,n_roh>20,clst,age,lat,lon
0,91.121798,I1178,Israel_C,703.292088,30,682.518689,26,625.156411,20,545.074702,15,Israel_C,5950,32.974167,35.331389
1,9.192801,I1170,Israel_C,26.837993,4,18.287802,2,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
2,4.8791,I1166,Israel_C,9.593799,2,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
3,4.202509,I1160,Israel_C,8.283711,2,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
4,0.0,I0644,Israel_C,0.0,0,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
5,0.0,I1168,Israel_C,0.0,0,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
6,0.0,I1152,Israel_C,0.0,0,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
7,0.0,I1165,Israel_C,0.0,0,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389
8,0.0,I1182,Israel_C,0.0,0,0.0,0,0.0,0,0.0,0,Israel_C,5950,32.974167,35.331389


One individual (I1178) has extremely high amounts of ROH! Well, maybe that's why I chose this dataset as the example :). 

For more utilities about interpreting and visualizing these ROH results, you can continue in `./plotROH_vignette.ipynb`.