# Quick Start for ancIBD

This notebook is a quick start guide for running ancIBD. It uses wrapper scripts for various functions introduced in section [preparing input](create_hdf5_from_vcf.ipynb) and [calling IBD with ancIBD](run_ancIBD.ipynb). Writing your own wrapper script for these functions provides more flexibility, while using the command line interface to be introduced in this quick starting guide is easier. We have created two command-line interfaces (`ancIBD-run` and `ancIBD-summary`)for running ancIBD quickly on your imputed data. The test data used to run these tutorials can be downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0. It contains imputed vcf of a subset of samples from early Neolithic Britain that belong to an extended pedigree ([Fowler et al.](https://www.nature.com/articles/s41586-021-04241-4)). 

### calling IBD

In addition to the imputed vcf files, you need additionally three files, all of which are provided in the same dropbox link as indicated above.

* marker_path: Path of the 1240k SNPs to use (you can find those in `./filters/snps_bcftools_ch*.csv` from the download link)
* map_path: Path of the map file to use (eigenstrat .snp file, you can find it in `./afs/v51.1_1240k.snp` from the download link)
* af_path (optional): Path of allele frequencies to merge into hdf5 file (you can find it in `./afs/v51.1_1240k_AF_ch*.tsv` from the download link. If not provided, allele frequencies calculated from samples themselves will be used)

We now run ancIBD on ch20 as an example. To run the following command, change the path to the above three files according to your own environment if needed. The file path in the following tutorial has assumed that the folder downloaded from dropbox link is in the same directory as this jupyter notebook.

In [2]:
# Modify file paths according to your own environment if needed
ch = 20
marker_path = f'./data/filters/snps_bcftools_ch{ch}.csv'
map_path = './data/afs/v51.1_1240k.snp'
af_path = f'./data/afs/v51.1_1240k_AF_ch{ch}.tsv'
vcf_path = f'./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz'

In [4]:
!ancIBD-run --vcf $vcf_path --ch $ch --out test --marker_path $marker_path --map_path $map_path --af_path $af_path --prefix example_hazelton

Print downsampling to 1240K...
Running bash command: 
bcftools view -Ov -o test/example_hazelton.ch20.1240k.vcf -T ./data/filters/snps_bcftools_ch20.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr20.vcf.gz
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 30377 Chr.20 1240K SNPs.
Intersection 28827 out of 28940 HDF5 SNPs
Interpolating 113 variants.
Finished Chromosome 20.
Adding map to HDF5...
Intersection 28827 out of 28940 target HDF5 SNPs. 113 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: test/example_hazelton.ch20.h5



If you already have the appropriate hdf5 file for your samples, you can also supply the command line with the hdf5 file directly. But please make sure that the hdf5 file has suffix "ch{chromosome number}.h5" (e.g, "test.ch20.h5").

In [6]:
!ancIBD-run --h5 ./test/example_hazelton.ch20.h5 --ch $ch --out test --marker_path $marker_path --map_path $map_path --af_path $af_path --prefix example_hazelton

now we can do the same for the all the 22 autosomes. This takes about 6min.

In [4]:
%%bash

map_path='./data/afs/v51.1_1240k.snp'

for ch in {1..22};
do
    marker_path=data/filters/snps_bcftools_ch$ch.csv
    af_path=data/afs/v51.1_1240k_AF_ch$ch.tsv
    vcf_path=data/vcf.raw/example_hazelton_chr$ch.vcf.gz
    ancIBD-run --vcf $vcf_path \
        --ch $ch --out test --marker_path $marker_path --map_path $map_path --af_path $af_path --prefix example_hazelton
done

Print downsampling to 1240K...
Running bash command: 
bcftools view -Ov -o test/example_hazelton.ch1.1240k.vcf -T data/filters/snps_bcftools_ch1.csv -M2 -v snps data/vcf.raw/example_hazelton_chr1.vcf.gz
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 88408 variants.
Loaded 6 individuals.
Loaded 93166 Chr.1 1240K SNPs.
Intersection 88115 out of 88408 HDF5 SNPs
Interpolating 293 variants.
Finished Chromosome 1.
Adding map to HDF5...
Intersection 88115 out of 88408 target HDF5 SNPs. 293 SNPs set to AF=0.5
Transformation complete! Find new hdf5 file at: test/example_hazelton.ch1.h5

Print downsampling to 1240K...
Running bash command: 
bcftools view -Ov -o test/example_hazelton.ch2.1240k.vcf -T data/filters/snps_bcftools_ch2.csv -M2 -v snps data/vcf.raw/example_hazelton_chr2.vcf.gz
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion t

<div class="alert alert-info"> 

Note

For large sample sizes, we recommend that one parallizes over autosomes for speed-up (e.g, by submitting array jobs on a cluster). The above for-loop is efficient only for small sample sizes. 

</div>

### Combine IBD over 22 autosomes and generate summary statistics

Now that we have individual IBD files for each of the autosome, we can combine the information across chromosomes and obtain genome-wide summary statistics for all pairs of samples (Only pairs of samples that share at least one IBD passing the length cutoff are recorded). 

In [5]:
!ancIBD-summary --tsv test/example_hazelton.ch --out test/

Chromosome 1; Loaded 10 IBD
Chromosome 2; Loaded 9 IBD
Chromosome 3; Loaded 6 IBD
Chromosome 4; Loaded 9 IBD
Chromosome 5; Loaded 8 IBD
Chromosome 6; Loaded 7 IBD
Chromosome 7; Loaded 9 IBD
Chromosome 8; Loaded 7 IBD
Chromosome 9; Loaded 6 IBD
Chromosome 10; Loaded 7 IBD
Chromosome 11; Loaded 5 IBD
Chromosome 12; Loaded 5 IBD
Chromosome 13; Loaded 8 IBD
Chromosome 14; Loaded 6 IBD
Chromosome 15; Loaded 3 IBD
Chromosome 16; Loaded 6 IBD
Chromosome 17; Loaded 4 IBD
Chromosome 18; Loaded 5 IBD
Chromosome 19; Loaded 8 IBD
Chromosome 20; Loaded 6 IBD
Chromosome 21; Loaded 6 IBD
Chromosome 22; Loaded 6 IBD
Saved 146 IBD to test/ch_all.tsv.
> 8.0 cM: 146/146
Of these with suff. SNPs per cM> 220:               113/146
4     9
2     8
1     7
13    7
6     7
8     7
10    7
21    6
5     6
7     6
16    6
11    5
9     5
12    4
18    4
20    4
3     4
14    3
17    3
22    3
15    2
Name: ch, dtype: int64
Saved 9 individual IBD pairs to: test/ibd_ind.tsv


To view the complete options provided by the two command-line interface, use -h. For power users or people interested in applying the method beyond 1240k SNP set, keep in mind that one can obtain maximum flexibility by writing one's own wrappers (see section [prepare input](create_hdf5_from_vcf.ipynb), [run ancIBD](run_ancIBD.ipynb), and [visualization](plot_IBD.ipynb))

In [10]:
!ancIBD-run -h

usage: ancIBD-run [-h] [--vcf VCF] [--h5 H5] --ch CH --marker_path MARKER_PATH
                  --map_path MAP_PATH [--af_path AF_PATH] [--out OUT]
                  [--prefix PREFIX] [--min MIN] [--iid IID] [--pair PAIR]

Run ancIBD.

optional arguments:
  -h, --help            show this help message and exit
  --vcf VCF             path to the imputed vcf file
  --h5 H5               path to hdf5 file. If specified, ancIBD will skip the
                        vcf to hdf5 conversion step. Only one of --vcf and
                        --h5 should be specified.
  --ch CH               chromosome number (1-22).
  --marker_path MARKER_PATH
                        path to the marker file
  --map_path MAP_PATH   path to the map file
  --af_path AF_PATH     path to the allele frequency file (optional)
  --out OUT             output folder to store IBD results and the
                        intermediary .hdf5 file. If not specified, the results
                        will be stored in the

In [11]:
!ancIBD-summary -h

usage: ancIBD-summary [-h] --tsv TSV [--ch CH] [--bin BIN] [--snp_cm SNP_CM]
                      [--out OUT]

Run ancIBD.

optional arguments:
  -h, --help       show this help message and exit
  --tsv TSV        base path to the individual IBD files.
  --ch CH          chromosome number, expressed in the format chrom-chrom,
                   e.g, 1-22). The default is 1-22.
  --bin BIN        length bin over which IBD sharing summary statistics for
                   pairs of samples will be calculated. Default is 8,12,16,20.
  --snp_cm SNP_CM  minimum number of SNPs per centimorgan for a segment to be
                   considered. The default is 220 to reduce false positive
                   rates.
  --out OUT        output folder to store results. If not specified, the
                   results will be stored in the current directory.
