# Vignette Notebook: How to call IBD with ancIBD
This notebook walks you through the steps to call IBD segments with `ancIBD`.
It assumes one has data in hdf5 format, including genetic map and ideally also allele frequency data. For how to produce such a .hdf5 file from an imputed VCF, please see the vignette notebook `create_hdf5_from_vcf.ipynb`

In [1]:
import sys as sys
import matplotlib.cm as cm
import pandas as pd
import os as os

### The following code gives nice and clean Arial font on your plots
from matplotlib import rcParams
rcParams['font.family'] = 'sans-serif'   # Set the defaul
rcParams['font.sans-serif'] = ['Arial']

### Set working directory to your vignette folder
###
# Edit the  following path for your folder
path = "/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette/"

os.chdir(path)  # Set the right Path (in line with Atom default)
print(os.getcwd())
###

### The following Code sets the working directory to your ancIBD code
### Only uncomment if you want to use not the pip installed package
sys.path.insert(0,"/n/groups/reich/hringbauer/git/hapBLOCK/package/")  # hack to get development package first in path

/n/groups/reich/hringbauer/git/hapBLOCK/notebook/vignette


# Code to run IBD calling
Note: For a quick test run for a pairs of indivdiuals and a specific chromosome including a visualization of the posterior, see the vignette notebook `./plot_IBD.ipynb`. This visual test is always recommended to verify whether your data is sound and everything works as expected.

Here, the walk-through is through the calling of IBD in a full data set including multiple individuals and chromosomes.

In [2]:
from ancIBD.run import hapBLOCK_chroms

### Specify list of Indivdiuals to screen for IBD
This list is the list of indivdiuals that will be loaded into memory by ancIBD.

**Please Remember**: ancIBD is data-hungry, and only works for sample **with >600,000 1240k SNPs covered**. If you run it for indivdiuals with fewer SNPs, you will receive output, **but that output is not trustworthy**, with little power to call IBD and high false positive rates.

In [3]:
iids = ["I12439", "I12440", "I12438", 
        "I12896", "I21390", "I30300"] # The six Hazelton Indivdiuals of the example data.

### Screen the sample for IBD
The function `hapBLOCK_chroms` calls IBD of an input hdf5. It outputs a dataframe of IBD, which below the function is saved into a specified output folder.

Important parameters varying from application to application are:
- folder_in: The path of the hdf5 files used for IBD calling. The format is so that everything up to the chrosome number is specified.
- iids: All iids to load from the hdf. Has to match the IIDs in the `sample` field of the HDF5
- run_iids: Which pairs to run [[iid1, iid2], ...] If this parameter is left empty, then all pairs are run.

The other parameters specifiying the various modes of `ancIBD`, and the parameters are default values. E.g. ibd_in, ibd_out, ibd_jump control the jump rates of the underlying HMM (in rates per Morgan), and the various `_model` parameter specificy the type of input data. Power users can modify those settings, but the default values are the recommend parameters for default human data.

In [4]:
%%time

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./data/hdf5/example_hazelton_chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./output/ibd_hazelton/',
                             output=False, prefix_out='', logfile=False,
                             l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)

CPU times: user 8.63 s, sys: 162 ms, total: 8.79 s
Wall time: 9.46 s


Congrats, now you have all the IBD! Notice the speed of the IBD caller. 
All pairs of six indivduals (15 in total) and all chromosomes only took few seconds to run.

### Combine IBD calls from all chromosomes
Now we need to combine the IBD calls from each chromosome into one overall file. The reason why this is done as seperate function is to allow for parallelization of the above function (i.e. as an array submission on a scientific cluster).

In [5]:
from ancIBD.IO.ind_ibd import combine_all_chroms

In [6]:
combine_all_chroms(chs=range(1,23),
                   folder_base='./output/ibd_hazelton/ch',
                   path_save='./output/ibd_hazelton/ch_all.tsv')

Chromosome 1; Loaded 35 IBD
Chromosome 2; Loaded 22 IBD
Chromosome 3; Loaded 23 IBD
Chromosome 4; Loaded 20 IBD
Chromosome 5; Loaded 22 IBD
Chromosome 6; Loaded 15 IBD
Chromosome 7; Loaded 24 IBD
Chromosome 8; Loaded 13 IBD
Chromosome 9; Loaded 18 IBD
Chromosome 10; Loaded 16 IBD
Chromosome 11; Loaded 17 IBD
Chromosome 12; Loaded 21 IBD
Chromosome 13; Loaded 19 IBD
Chromosome 14; Loaded 10 IBD
Chromosome 15; Loaded 17 IBD
Chromosome 16; Loaded 14 IBD
Chromosome 17; Loaded 18 IBD
Chromosome 18; Loaded 16 IBD
Chromosome 19; Loaded 17 IBD
Chromosome 20; Loaded 17 IBD
Chromosome 21; Loaded 14 IBD
Chromosome 22; Loaded 13 IBD
Saved 401 IBD to ./output/ibd_hazelton/ch_all.tsv.


### Postprocess into summary dataframe
For easy screening for IBD between pairs, the function `create_ind_ibd_df`  produces a summary table.
Each row is one pair of individuals, and there are columns for summary statistics:
- max_ibd: Maximum Length of IBD
- sum_IBD>x: The total length of all IBD segments longer than x Morgan
- n_IBD>x: The total number of all IBD segments longer than x Morgan

By default, these are recorded for >8,>12,>16 and >20 Morgan. This can be changed with the `min_cms` keyword.

The function also does post-processing of trustworthy IBD blocks. Most importantly, only IBD with at least a certain SNP density are kept. The reason for this is that areas of low SNP density (such as regions with large gaps of SNPs) are very prone to false positives.

**Note**: Only pairs with at least one IBD are recorded (in the above >6 cM). So if a pair of indivdiuals is missing means that this pair of indivdiuals does not have any shared IBD segments. The reason for this omission is that in large samples, most indivdiual pairs will have 0 IBD - thus there would be a large memory requirement but for little informational gain.

In [7]:
from ancIBD.IO.ind_ibd import create_ind_ibd_df

In [8]:
%%time
df_res = create_ind_ibd_df(ibd_data = './output/ibd_hazelton/ch_all.tsv',
                      min_cms = [8, 12, 16, 20], snp_cm = 220, min_cm = 5, sort_col = 0,
                      savepath = "./output/ibd_hazelton/ibd_ind.d220.tsv")

> 5 cM: 401/401
Of these with suff. SNPs per cM> 220:               293/401
1     23
2     21
3     19
4     19
12    18
5     17
10    16
11    16
13    15
6     15
7     14
16    13
20    13
21    12
8     12
9     11
14     9
18     9
15     8
17     8
22     5
Name: ch, dtype: int64
Saved 15 individual IBD pairs to: ./output/ibd_hazelton/ibd_ind.d220.tsv
CPU times: user 83.8 ms, sys: 5.14 ms, total: 88.9 ms
Wall time: 113 ms


Congrats, that closes the post-processing of IBD. Let's have a look into the output data:

In [9]:
df_ibd = pd.read_csv('./output/ibd_hazelton/ibd_ind.d220.tsv', sep="\t")
df_ibd

Unnamed: 0,iid1,iid2,max_IBD,sum_IBD>8,n_IBD>8,sum_IBD>12,n_IBD>12,sum_IBD>16,n_IBD>16,sum_IBD>20,n_IBD>20
0,I12438,I30300,283.652203,3334.432993,20.0,3334.432993,20.0,3334.432993,20.0,3334.432993,20.0
1,I12440,I30300,283.606503,3322.099692,20.0,3322.099692,20.0,3322.099692,20.0,3322.099692,20.0
2,I12440,I12438,176.739703,1687.484123,22.0,1687.484123,22.0,1675.113823,21.0,1675.113823,21.0
3,I12439,I12440,153.3077,1627.592178,20.0,1605.567779,18.0,1605.567779,18.0,1605.567779,18.0
4,I12439,I30300,98.252594,915.175968,13.0,915.175968,13.0,915.175968,13.0,915.175968,13.0
5,I12439,I12438,93.373895,497.595579,10.0,489.55828,9.0,489.55828,9.0,489.55828,9.0
6,I12440,I12896,26.5884,220.856794,19.0,101.284505,6.0,60.889204,3.0,26.5884,1.0
7,I12438,I12896,13.789403,153.881811,16.0,13.789403,1.0,0.0,0.0,0.0,0.0
8,I12439,I12896,16.200301,142.90681,14.0,43.321703,3.0,16.200301,1.0,0.0,0.0
9,I12896,I21390,17.000001,128.574783,12.0,57.787899,4.0,17.000001,1.0,0.0,0.0


Oh, there are a couple of individual pairs with ample IBD. Some of them in all chromosomes!

Congrats - you have now learned how to call IBD segments and to post-process the output. 

The next step is visualizing the output. The vignette notebook `./plot_IBD.ipynb` will walk you through that step.