# Vignette of Pypi-Package hapROH: Calling ROH from an Eigenstrat File
This notebook contains an example application of hapROH:
We will identify ROH in a target Eigenstrat. Importantly, it can serve as blueprint for your own ROH inference!

You will learn how to run hapROH on single chromosomes, on whole individuals, and sets of individuals, and how to post-process multiple output files into one summary datafile (including meta data)

If you want to learn how to visualize the results, take a look at the plotting vignette.

In [1]:
### Some Code to set right paths on Harald's Machine
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os as os
import sys as sys
import multiprocessing as mp

print(f"CPU Count: {mp.cpu_count()}")

### If wanting to use local version and not  pip installed version
#sys.path.insert(0,"./package/")  # hack to get local package first in path [FROM HARALD - DELETE!!!]

CPU Count: 28


# Set the Path
You can set the path here to the path you want to work in (relative data paths will be calculated relative to this folder.) Alternatively you can give absolute paths in the next steps.

In [2]:
### Fill in your own path here!
path = "/project2/jnovembre/hringbauer/HAPSBURG/"  # The Path to Package Midway Cluster
#path = "/home/harald/git/HAPSBURG/"   # The Path on Harald's machine

os.chdir(path)  # Set the right Path (in line with Atom default)
print(f"Set path to: {os.getcwd()}") # Show the current working directory. Should be HAPSBURG/Notebooks/ParallelRuns

Set path to: /project2/jnovembre/hringbauer/HAPSBURG


# 1) Download the data we will use as an example here
If you don't have it already, you will need to get the target and reference data. Here is how:

### Get Target Data
We will download an example eigenstrat available via the Reich lab homepage. 
It contains a handful of individuals from Chalcolithic Levant (zipped, its's ~20mb then).
You can download it from:
`https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/Levant_ChL.tar.gz`
Downlaod and unpack in the folder you need it (unpack to .geno, .ind and .snp file)

### Get Reference Data
You can download the reference data from 
`https://www.dropbox.com/s/0qhjgo1npeih0bw/1000g1240khdf5.tar.gz?dl=0`

This data is key for running hapROH - it is the 1000 Genome data downsampled to bi-allelic 1240k SNPs, from 5008 global reference haplotypes (Attention: ca. 800 mb big). 

There is also a metafile datatable that contains information about the reference panel, do not forget to copy that over too in addition to data for every chromosome.

# 2) Call ROH
Now we will run the core part of the package, a wrapper function for the 
core ROH calling machinery - `hapsb_ind`

In [3]:
from hapsburg.PackagesSupport.hapsburg_run import hapsb_ind  # Need this import

### Test calling ROH on single chromosome from Individual with full output printed
Set the location of the reference hdf5 and the target eigenstrat path correctly!

Input: Reference haplotype file (hdf5, containing genetic map), reference metafile, pseudo-haploid Eigenstrat
Output: Output file in folder `folder_out`, importantly a 

See example below, you need to set `path_targets` and `h5_path1000G`

logfile=False -> All output is printed into sys.out  
combine=False -> No individual .csv is created from chromosome .csvs 

### Minimal Version
A version where the complex parameters a normal user does not have to set hidden (i.e. set to default values that work well for 1240K data).
hapROH is tested for a wide variety of human aDNA data, and the default parameters likely have you covered if you run hapROH on <50 ky old 1240K data.

This function will call ROH of indivdiual `iid` in the target eigenstrat given in `path_targets`. We also set the path to the reference hdf5 file (`h5_path1000G`) and the metafile of the reference (`meta_path_ref`), and, importantly, into which folder to save the output to (`folder_out`).

Let us run the date for chromosome 20 (range(20,21) selects a list of length 1: [20,]), and using 1 processor (`processes=1`) as well as printing all text output (`output=True`) into the connsole (`logfile=False`). 

In [4]:
hapsb_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./Data/ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/',  # Folder where you want to save the results to 
          processes=1, output=True,
          readcounts=False, logfile=False, combine=False)

Doing Individual I1178...
Using Rescaled HMM.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from ./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr20.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Reduced to markers with data: 19815 / 29053
Fraction SNPs covered: 0.6820
Exctraction of hdf5 done. Subsetting...!
Extraction of 5008 Haplotypes complete
Flipping Ref/Alt Alleles in target for 0 SNPs...
Successfully saved target individual data to: ./Empirical/Eigenstrat/Example/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./Empirical/Eigenstrat/Example/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: m

### Power version [does same as above, but shows more parameters that could be changed]
Shows most of the parameters that you can set. Warning: Try to understand what you are doing.
E.g. the transition parameters are optimized for 1240k data already - changing them can have ...fun consequences.

The parameters are explained in the doc string of the function.

In [8]:
hapsb_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./Data/ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model='haploid', p_model='Eigenstrat', 
          post_model='Standard', processes=1, delete=False, output=True, save=True, save_fp=False, 
          n_ref=2504, diploid_ref=True, exclude_pops=[], readcounts=False, random_allele=True,
          c=0.0, conPop=["CEU"], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.00, 
          cutoff_post = 0.999, max_gap=0.005, roh_min_l_initial = 0.02, roh_min_l_final = 0.05,
          min_len1 = 0.02, min_len2 = 0.04, logfile=False, combine=False, file_result="_roh_full.csv")

Doing Individual I1178...
Using Rescaled HMM.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from ./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr20.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Reduced to markers with data: 19815 / 29053
Fraction SNPs covered: 0.6820
Exctraction of hdf5 done. Subsetting...!
Extraction of 5008 Haplotypes complete
Flipping Ref/Alt Alleles in target for 0 SNPs...
Successfully saved target individual data to: ./Empirical/Eigenstrat/Example/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./Empirical/Eigenstrat/Example/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: m

### Example run of whole individual (all chromosomes) with output to logfile
This example runs a whole individual in parallel, with the output send to a logfile

Attention: hapROH uses quite a bit of RAM, for an Eigenstrat it can be spiking up to 6gb for a long chromosome with most SNPs covered - allocate memory accoringly
when running multiple processes, or set that number lower!

In [7]:
hapsb_ind(iid="I1178", chs=range(1,23), processes=6, 
          path_targets='./Data/ExampleData/Levant_ChL', 
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model="haploid", p_model="Eigenstrat",
          random_allele=True, readcounts=False,
          delete=False, logfile=True, combine=True)

Doing Individual I1178...
Running 22 total jobs; 6 in parallel.
Starting Pool of multiple workers...
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr1/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr2/hmm_run_log.txt

Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr4/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr5/hmm_run_log.txt

Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr6/hmm_run_log.txt
Combining Information for 22 Chromosomes...
Run finished successfully!


### Run multiple Individuals
Here as a loop over individuals. In practice I run it as parallelized sbatch job (setting processes to 1 there) - this way you can parallelize it with 1 processor per indivual. That is a useful level of parallelization, as the individual output files will be combined in seperate step below.

In [8]:
### This are all individuals with 400k SNPs covered
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

for iid in iids:
    print(f"Doing Individual: {iid}")
    hapsb_ind(iid=iid, chs=range(1,23), processes=6, 
              path_targets='./Data/ExampleData/Levant_ChL', 
              h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
              meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
              folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
              e_model="haploid", p_model="EigenstratPacked", n_ref=2504,
              random_allele=True, readcounts=False,
              delete=False, logfile=True, combine=True)

Doing Individual: I1178
Doing Individual I1178...
Running 22 total jobs; 6 in parallel.
Starting Pool of multiple workers...
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr5/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr4/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr6/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr1/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr3/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr2/hmm_run_log.txt





Combining Information for 22 Chromosomes...
Run finished successfully!
Doing Individual: I0644
Doing Individual I0644...
Running 22 total jobs; 6 in parallel.
Starting Pool of multiple workers...
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr1/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr2/hmm_run_log.txt

Set Output Log path: ./Empirical/Eigenstrat/Exa

### Postprocess Results into one results.csv (copying in Meta Data)
Take indivdiual output .csvs and combine into one big results .csv
Merging of output gaps, and ROH>x cM happens here

In [9]:
from hapsburg.PackagesSupport.pp_individual_roh_csvs import pp_individual_roh

### Create Example Meta File
This file will create a minimal meta data file (some plotting functions use individual meta information, such as age),
which is then merged into the results when combining individuals.

If you have metadata available for your dataset, you can prepare a comma seperated csv table with these headers!

In [None]:
iids = ['I1178', 'I0644', 'I1160', 
        'I1152', 'I1168', 'I1166', 
        'I1170', 'I1165', 'I1182']
df = pd.DataFrame({"iid":iids})

df["age"] = 5950
df["clst"] = "Israel_C"
df["lat"] = 32.974167
df["lon"] = 35.331389

df.to_csv("./Data/ExampleData/meta_blank.csv", 
          sep=",", index=False)

### Combine individual output files into a single ROH output File
Combines meta file, individual ROH output files
into a summary table, contining statistical information for ROH for each indiviual.

This is the table that a lot of the combined plotting software uses.

This also does additional post-processing of the ROH, in particular merging gaps between ROH.
Every gap of length gap [in cM] between two ROH of >min_len1 and >min_len2 is merged.

This can be already done in hapsb_ind, but also here. That is useful e.g. in cases where one wants to try different post-processing and gap-merging:
One could set the gap to 0 in hapsb_ind and then here try different values.



In [11]:
%%time
### Postprocess the two Individuals from above and combine into one results .csv
iids = ['I1178', 'I0644', 'I1160', 'I1152', 
        'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

df1 = pp_individual_roh(iids, meta_path="./Data/ExampleData/meta_blank.csv", 
                        base_folder="./Empirical/Eigenstrat/Example/",
                        save_path="./Empirical/Eigenstrat/Example/combined_roh05.csv", 
                        output=False, min_cm=[4, 8, 12, 20], snp_cm=50, 
                        gap=0.5, min_len1=2.0, min_len2=4.0)

Loaded 9 / 9 Individuals from Meta
Saved to: ./Empirical/Eigenstrat/Example/combined_roh05.csv
CPU times: user 4.36 s, sys: 11.8 ms, total: 4.37 s
Wall time: 4.37 s
