# Vignette of Pypi-Package hapROH: Calling ROH from an Eigenstrat File
This notebook contains an example application of hapROH:
We will identify ROH in a target Eigenstrat. Importantly, it can serve as blueprint for your own ROH inference!

You will learn how to run hapROH on single chromosomes, on whole individuals, and sets of individuals, and how to post-process multiple output files into one summary datafile (including meta data)

If you want to learn how to visualize the results, have a look at the plotting vignette!

In [7]:
### Some Code to set right paths on Harald's Machine
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os as os
import sys as sys
import multiprocessing as mp

print(f"CPU Count: {mp.cpu_count()}")

### If wanting to use local version and not  pip installed version
#sys.path.append("./package/") # Append local Hapsburg Folder
import sys
sys.path.insert(0,"./package/")  # hack to get local package first in path [FROM HARALD - DELETE!!!]

### Fill in your own path here!
path = "/project2/jnovembre/hringbauer/HAPSBURG/"  # The Path to Package Midway Cluster
#path = "/home/harald/git/HAPSBURG/"   # The Path on Harald's machine
os.chdir(path)  # Set the right Path (in line with Atom default)
print(f"Set path to: {os.getcwd()}") # Show the current working directory. Should be HAPSBURG/Notebooks/ParallelRuns

CPU Count: 28
Set path to: /project2/jnovembre/hringbauer/HAPSBURG


# 2) Call ROH

In [8]:
from hapsburg.PackagesSupport.hapsburg_run import hapsb_ind  # Need this import

### Power version [does same as above, but shows more parameters that could be changed]
Shows most of the parameters that you can set. Warning: Try to understand what you are doing.
E.g. the transition parameters are optimized for 1240k data already - changing them can have ...fun consequences.

In [9]:
hapsb_ind(iid="Villabruna", chs=range(17, 18), 
          path_targets='./Data/ReichLabEigenstrat/Raw.v42.4/v42.4.1240K', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model='haploid', p_model='Eigenstrat', 
          post_model='Standard', processes=1, delete=False, output=True, save=True, 
          save_fp=False, n_ref=2504, exclude_pops=[], readcounts=False, random_allele=True, 
          roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, 
          cutoff_post=0.999, max_gap=0, roh_min_l=0.01, 
          logfile=False, combine=False, file_result='_roh_full.csv')

Doing Individual Villabruna...
Running 1 total jobs; 1 in parallel.
Running single process...
Using Rescaled HMM.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: Villabruna

Loaded 29314 variants
Loaded 2504 individuals
HDF5 loaded from ./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr17.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 6676 Individuals and 1233013 SNPs

Intersection on Positions: 29314
Nr of Matching Refs: 29314 / 29314
Ref/Alt Matching: 29292 / 29314
Flipped Ref/Alt Matching: 0
Together: 29292 / 29314
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Reduced to markers with data: 23996 / 29292
Fraction SNPs covered: 0.8192
Exctraction of hdf5 done. Subsetting...!
Extraction of 5008 Haplotypes complete
Flipping Ref/Alt Alleles in target for 0 SNPs...
Successfully saved target individual data to: ./Empirical/Eigenstrat/Example/Villabruna/chr17/
Shuffling phase of target...
Successfully loaded Data from: ./Empirical/Eigenstra

### Example run of whole individual (all chromosomes) with output to logfile
This example runs a whole individual in parallel, with the output send to a logfile

Attention: hapROH uses quite a bit of RAM, for an Eigenstrat it can be spiking up to 6gb for a long chromosome with most SNPs covered - allocate memory accoringly
when running multiple processes, or set that number lower!

In [10]:
hapsb_ind(iid="Villabruna", chs=range(1, 23), 
          path_targets='./Data/ReichLabEigenstrat/Raw.v42.4/v42.4.1240K', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model='haploid', p_model='Eigenstrat', 
          post_model='Standard', processes=8, delete=False, output=True, save=True, 
          save_fp=False, n_ref=2504, exclude_pops=[], readcounts=False, random_allele=True, 
          roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, 
          cutoff_post=0.999, max_gap=0, roh_min_l=0.01, 
          logfile=True, combine=True, file_result='_roh_full.csv')

Doing Individual Villabruna...
Running 22 total jobs; 8 in parallel.
Starting Pool of multiple workers...
Set Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr2/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr1/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr4/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr3/hmm_run_log.txt



Set Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr5/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr6/hmm_run_log.txtSet Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr7/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/Villabruna/chr8/hmm_run_log.txt


Combining Information for 22 Chromosomes...
Run finished successfully!


### Run multiple Individuals
Here as a loop over individuals. In practice I run it as parallelized sbatch job (setting processes to 1 there) - this way you can parallelize it with 1 processor per indivual. That is a useful level of parallelization, as the individual output files will be combined in seperate step below.

In [43]:
### This are all individuals with 400k SNPs covered
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

for iid in iids:
    print(f"Doing Individual: {iid}")
    hapsb_ind(iid=iid, chs=range(1,23), processes=6, 
              path_targets='./Data/ExampleData/Levant_ChL', 
              h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
              meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
              folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
              e_model="haploid", p_model="EigenstratPacked", n_ref=2504,
              random_allele=True, readcounts=False,
              delete=False, logfile=True, combine=True)

Doing Individual: I1178
Doing Individual I1178...
Running 22 total jobs; 6 in parallel.
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr1/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr2/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr5/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr4/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr6/hmm_run_log.txt
Combining Information for 22 Chromosomes...
Run finished successfully!
Doing Individual: I0644
Doing Individual I0644...
Running 22 total jobs; 6 in parallel.
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr1/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr2/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstra

### Postprocess Results into one results.csv (copying in Meta Data)
Take indivdiual output .csvs and combine into one big results .csv
Merging of output gaps, and ROH>x cM happens here

In [44]:
from hapsburg.PackagesSupport.pp_individual_roh_csvs import pp_individual_roh

### Create Example Meta File
This file will create a minimal meta data file (some plotting functions use individual meta information, such as age),
which is then merged into the results when combining individuals.

If you have metadata available for your dataset, you can prepare a comma seperated csv table with these headers!

In [52]:
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']
df = pd.DataFrame({"iid":iids})

df["age"] = 5950
df["clst"] = "Israel_C"
df["lat"] = 32.974167
df["lon"] = 35.331389

df.to_csv("./Data/ExampleData/meta_blank.csv", 
          sep=",", index=False)

### Combine individual output files into a single ROH output File
Combines meta file, individual ROH output files
into a summary table, contining statistical information for ROH for each indiviual.

This is the table that a lot of the combined plotting software uses.

In [56]:
%%time
### Postprocess the two Individuals from above and combine into one results .csv
iids = ['I1178', 'I0644', 'I1160', 'I1152', 
        'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

df1 = pp_individual_roh(iids, meta_path="./Data/ExampleData/meta_blank.csv", 
                        base_folder="./Empirical/Eigenstrat/Example/",
                        save_path="./Empirical/Eigenstrat/Example/combined_roh05.csv", 
                        output=False, min_cm=[4, 8, 12, 20], snp_cm=50, 
                        gap=0.5, min_len1=2.0, min_len2=4.0)

Loaded 9 / 9 Individuals from Meta
Saved to: ./Empirical/Eigenstrat/Example/combined_roh05.csv
CPU times: user 4.4 s, sys: 0 ns, total: 4.4 s
Wall time: 4.39 s


# Area 51