# Vignette of Pypi-Package hapROH: Calling ROH from an Eigenstrat File
This notebook contains an example application of hapROH:
We will identify ROH in a target Eigenstrat. Importantly, it can serve as blueprint for your own ROH inference!

You will learn how to run hapROH on single chromosomes, on whole individuals, and sets of individuals, and how to post-process multiple output files into
one summary datafile (including meta data)

If you want to learn how to visualize the results, have a look at the plotting vignette!

In [1]:
### Some Code to set right paths on Harald's Machine
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os as os
import sys as sys
import multiprocessing as mp

print(f"CPU Count: {mp.cpu_count()}")

### If wanting to use local version and not  pip installed version
#sys.path.append("./package/") # Append local Hapsburg Folder
import sys
sys.path.insert(0,"./package/")  # hack to get local package first in path [FROM HARALD - DELETE!!!]

CPU Count: 28


# Set the Path
If wanted, set the path here to the path you want to work in (relative data loads will be done from there)

In [2]:
### Fill in your own path here!
path = "/project2/jnovembre/hringbauer/HAPSBURG/"  # The Path to Package Midway Cluster
#path = "/home/harald/git/HAPSBURG/"   # The Path on Harald's machine

os.chdir(path)  # Set the right Path (in line with Atom default)
print(f"Set path to: {os.getcwd()}") # Show the current working directory. Should be HAPSBURG/Notebooks/ParallelRuns

Set path to: /project2/jnovembre/hringbauer/HAPSBURG


# 1) Get the data we will use as an example here
If you don't have it, you will need to get the target and reference data. Here is how:

### Get Target Data
Download the data which 

We will download an example eigenstrat available via the Reich lab. 
It contains a handful of individuals from Chalcolithic Levant (zipped, its's ~20mb then).
You can download it from:
`https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/Levant_ChL.tar.gz`
Downlaod and unpack in the folder you need it (unpack to .geno, .ind and .snp file)

### Get Reference Data
You can download the reference data from 
`https://www.dropbox.com/s/0qhjgo1npeih0bw/1000g1240khdf5.tar.gz?dl=0`

This data is key for running hapROH - it is the 1000 Genome data downsampled to bi-allelic 1240k SNPs, from 5008 global reference haplotypes. (Attention: ca. 800 mb big)

# 2) Call ROH
Now we will run the core part of the package, a wrapper function for the 
core ROH calling machinery - `hapsb_ind`

In [4]:
from hapsburg.PackagesSupport.hapsburg_run import hapsb_ind  # Need this import

### Test calling ROH on single chromosome from Individual with full output printed
Set the location of the reference hdf5 and the target eigenstrat path correctly!

Input: Reference haplotype file (hdf5, containing genetic map), reference metafile, pseudo-haploid Eigenstrat
Output: Output file in folder `folder_out`, importantly a 

See example below, you need to set `path_targets` and `h5_path1000G`

logfile=False -> All output is printed into sys.out  
combine=False -> No individual .csv is created from chromosome .csvs 

### Minimal Version
A version where the complex parameters a normal user will not see a hidden (i.e. set to default).
hapROH is tested for a wide variety of human aDNA data, and the standard parameters likely got you covered.

In [7]:
hapsb_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./Data/ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/',  # Folder where you want to save the results to 
          processes=1, output=True,
          readcounts=False, logfile=False, combine=False)

Doing Individual I1178...
Running 1 total jobs; 1 in parallel.
Using Low-Mem Cython Linear Speed Up.
Loaded Pre Processing Model: EigenstratPacked
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from ./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr20.hdf5
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Extraction of 5008 Haplotypes complete
Flipping Ref/Alt in target for 0 SNPs...
Reduced to markers called 19815 / 29053
Fraction SNPs covered: 0.6820
Successfully saved to: ./Empirical/Eigenstrat/Example/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./Empirical/Eigenstrat/Example/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: model
Loaded Post Processing Model: Stan

### Power version [does same as above, but shows more setable parameters]
A lot of the parameters you can set. Warning: Try to understand what you are doing.
E.g. the transition parameters are optimized for 1240k data already.

In [5]:
hapsb_ind(iid="I1178", chs=range(20, 21), 
          path_targets='./Data/ExampleData/Levant_ChL', # The path before the .ind, .snp, .geno
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model='haploid', p_model='Eigenstrat', 
          post_model='Standard', processes=1, delete=False, output=True, save=True, 
          save_fp=False, n_ref=2504, exclude_pops=[], readcounts=True, random_allele=True, 
          roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, 
          cutoff_post=0.999, max_gap=0, roh_min_l=0.01, 
          logfile=False, combine=False, file_result='_roh_full.csv')

Doing Individual I1178...
Running 1 total jobs; 1 in parallel.
Using Low-Mem Cython Linear Speed Up.
Loaded Pre Processing Model: Eigenstrat
Loading Individual: I1178

Loaded 29078 variants
Loaded 2504 individuals
HDF5 loaded from ./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr20.hdf5
Eigenstrat packed: True
3 Eigenstrat Files with 22 Individuals and 1233013 SNPs

Intersection on Positions: 29078
Nr of Matching Refs: 29078 / 29078
Ref/Alt Matching: 29053 / 29078
Flipped Ref/Alt Matching: 0
Together: 29053 / 29078
2504 / 2504 Individuals included in Reference
Extracting up to 2504 Individuals
Extraction of 5008 Haplotypes complete
Flipping Ref/Alt in target for 0 SNPs...
Reduced to markers called 19815 / 29053
Fraction SNPs covered: 0.6820
Successfully saved to: ./Empirical/Eigenstrat/Example/I1178/chr20/
Shuffling phase of target...
Successfully loaded Data from: ./Empirical/Eigenstrat/Example/I1178/chr20/
Loaded Emission Model: haploid
Loaded Transition Model: model
Loaded Post Proc

### Example run of whole individual (all chromosomes) with output to logfile
This example runs a whole individual in parallel, with the output send to a logfile

Attention: hapROH uses quite a bit of RAM, for an Eigenstrat it can be spiking up to 6gb for a long chromosome with most SNPs covered - allocate memory accoringly
when running multiple processes, or set that number lower!

In [8]:
hapsb_ind(iid="I1178", chs=range(1,23), processes=6, 
          path_targets='./Data/ExampleData/Levant_ChL', 
          h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
          meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
          folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
          e_model="haploid", p_model="Eigenstrat",
          random_allele=True, readcounts=False,
          delete=False, logfile=True, combine=True)

Doing Individual I1178...
Running 22 total jobs; 6 in parallel.
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr1/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr2/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr4/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr6/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr5/hmm_run_log.txt
Combining Information for 22 Chromosomes...
Run finished successfully!


### Run multiple Individuals
Here as a loop over individuals. In practice I run it as parallelized sbatch job (setting processes to 1 there)

In [43]:
### This are all individuals with 400k SNPs covered
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']

for iid in iids:
    print(f"Doing Individual: {iid}")
    hapsb_ind(iid=iid, chs=range(1,23), processes=6, 
              path_targets='./Data/ExampleData/Levant_ChL', 
              h5_path1000g='./Data/1000Genomes/HDF5/1240kHDF5/all1240int8/chr', 
              meta_path_ref='./Data/1000Genomes/Individuals/meta_df_all.csv', 
              folder_out='./Empirical/Eigenstrat/Example/', prefix_out='', 
              e_model="haploid", p_model="EigenstratPacked", n_ref=2504,
              random_allele=True, readcounts=False,
              delete=False, logfile=True, combine=True)

Doing Individual: I1178
Doing Individual I1178...
Running 22 total jobs; 6 in parallel.
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr1/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr2/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr5/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr4/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I1178/chr6/hmm_run_log.txt
Combining Information for 22 Chromosomes...
Run finished successfully!
Doing Individual: I0644
Doing Individual I0644...
Running 22 total jobs; 6 in parallel.
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr1/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr2/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstrat/Example/I0644/chr3/hmm_run_log.txt
Set Output Log path: ./Empirical/Eigenstra

### Postprocess Results into one results.csv (copying in Meta Data)
Take indivdiual output .csvs and combine into one big results .csv
Merging of output gaps, and ROH>x cM happens here

In [44]:
from hapsburg.PackagesSupport.pp_individual_roh_csvs import pp_individual_roh

### Create Example Meta File
This file will yield the meta data file (some plotting functions use it),
which is then merged into the results when combining individuals

In [52]:
iids = ['I1178', 'I0644', 'I1160', 'I1152', 'I1168', 'I1166', 'I1170', 'I1165', 'I1182']
df = pd.DataFrame({"iid":iids})

df["age"] = 5950
df["clst"] = "Israel_C"
df["lat"] = 32.974167
df["lon"] = 35.331389

df.to_csv("./Data/ExampleData/meta_blank.csv", 
          sep=",", index=False)

### Combine into output File
Combines meta file, individual ROH output files
into a summary table, contining ROH information for each indiviual

In [56]:
%%time
### Postprocess the two Individuals from above and combine into one results .csv
iids = ['I1178', 'I0644', 'I1160', 'I1152', 
        'I1168', 'I1166', 'I1170', 
        'I1165', 'I1182']

df1 = pp_individual_roh(iids, meta_path="./Data/ExampleData/meta_blank.csv", 
                        base_folder="./Empirical/Eigenstrat/Example/",
                        save_path="./Empirical/Eigenstrat/Example/combined_roh05.csv", 
                        output=False, min_cm=[4, 8, 12, 20], snp_cm=50, 
                        gap=0.5, min_len1=2.0, min_len2=4.0)

Loaded 9 / 9 Individuals from Meta
Saved to: ./Empirical/Eigenstrat/Example/combined_roh05.csv
CPU times: user 4.4 s, sys: 0 ns, total: 4.4 s
Wall time: 4.39 s
