NGS files are all about big data. Fortunately, the bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/),
you can download tabix and bgzip, which will take care of data management. On the command line, perform the following operation:
</br>
command:
</br>
`apt-get install tabix`

The bellow line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

In [None]:
!wget https://github.com/samtools/htslib/releases/download/1.19/htslib-1.19.tar.bz2
!tar -xf htslib-1.19.tar.bz2
%cd htslib-1.19
!autoreconf -i
!./configure --enable-libcurl
!make
!make install

In [2]:
!tabix -fh https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000|bgzip -c > genotypes.vcf.gz

'tabix' is not recognized as an internal or external command,
operable program or batch file.


Then, We should create an index, which we will need for direct access to a section of the genome.

In [3]:
!tabix -p vcf genotypes.vcf.gz

'tabix' is not recognized as an internal or external command,
operable program or batch file.


**Let’s start by inspecting the information that we can get per record:**

We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). 
</br>
At the variant (record) level, we find AC—the total number of ALT alleles in called genotypes, AF—the estimated allele frequency, NS—the number of samples with data, AN—the total number of alleles in called genotypes, and DP—the total read depth. 
</br>
There are others,
but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as
possible). Your own dataset may have more annotations (or none of these).
At the sample level, there are only two annotations in this file: GT—genotype, and DP—the
per-sample read depth. You have the per-variant (total) read depth and the per-sample read
depth; be sure not to confuse both.

In [14]:
pip install cyvcf2

Collecting cyvcf2Note: you may need to restart the kernel to use updated packages.
  Using cached cyvcf2-0.30.25.tar.gz (1.4 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'


  error: subprocess-exited-with-error
  
  × Building wheel for cyvcf2 (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [75 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-39
      creating build\lib.win-amd64-cpython-39\cyvcf2
      copying cyvcf2\cli.py -> build\lib.win-amd64-cpython-39\cyvcf2
      copying cyvcf2\__init__.py -> build\lib.win-amd64-cpython-39\cyvcf2
      copying cyvcf2\__main__.py -> build\lib.win-amd64-cpython-39\cyvcf2
      creating build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test_cli.py -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test_hemi.py -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test_reader.py -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test_writer.py -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests


  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting coloredlogs (from cyvcf2)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
     -------------------------------------- 46.0/46.0 kB 163.4 kB/s eta 0:00:00
Collecting humanfriendly>=9.1 (from coloredlogs->cyvcf2)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
     -------------------------------------- 86.8/86.8 kB 490.7 kB/s eta 0:00:00
Collecting pyreadline3 (from humanfriendly>=9.1->coloredlogs->cyvcf2)
  Downloading pyreadline3-3.4.1-py3-none-any.whl (95 kB)
     -------------------------------------- 95.2/95.2 kB 913.6 kB/s eta 0:00:00
Building wheels for collected packages: cyvcf2
  Building wheel for cyvcf2 (pyproject.toml): started
  Building wheel for cyvcf2 (pyproject.toml): finished with status 'error'
Failed to 

      copying cyvcf2\relatedness.h -> build\lib.win-amd64-cpython-39\cyvcf2
      copying cyvcf2\tests\bug.vcf.gz -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\decomposed.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\empty.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\issue_198.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\issue_44.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\no-seq-len.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\no-seq-names.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\o.vcf.gz -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\seg.vcf.gz -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test-alt-repr.vcf -> build\lib.win-amd64-cpython-39\cyvcf2\tests
      copying cyvcf2\tests\test-diff.csi -> build\l

In [7]:
# required package
from cyvcf2 import VCF
v = VCF('genotypes.vcf.gz')
rec = next(v)
print('Variant Level information')
info = rec.INFO
for info in rec.INFO:
  print(info)
print('Sample Level information')
for fmt in rec.FORMAT:
  print(fmt)

ModuleNotFoundError: No module named 'cyvcf2'

**Now that we know what information is available, let’s inspect a single VCF record:**

We will start by retrieving the standard information: the chromosome, position, ID, reference base
(typically just one) and alternative bases (you can have more than one, but it’s not uncommon
as a first filtering approach to only accept a single ALT, for example, only accept biallelic
SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status,
remember that whatever the VCF file says, you may still want to apply extra filters (as in the
next recipe, Studying genome accessibility and filtering SNP data).
We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on),
followed by the sample format (in this case, DP and GT). Finally, we count the number of
samples and inspect a single sample to check whether it was called for this variant. Also, the
reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which
is not that common) are included.

In [19]:
v = VCF('genotypes.vcf.gz')
samples = v.samples
print(len(samples))
variant = next(v)
print(variant.CHROM, variant.POS, variant.ID, variant.
REF, variant.ALT, variant.QUAL, variant.FILTER)
print(variant.INFO)
print(variant.FORMAT)
print(variant.is_snp)
str_alleles = variant.gt_bases[0]
alleles = variant.genotypes[0][0:2]
is_phased = variant.genotypes[0][2]
print(str_alleles, alleles, is_phased)
print(variant.format('DP')[0])

2504
22 16050075 None A ['G'] 100.0 None
<cyvcf2.cyvcf2.INFO object at 0x7f8f0bb43ed0>
['GT', 'DP']
True
A|A [0, 0] True
1


**Let’s check the type of variant and the number of nonbiallelic SNPs in a single pass:**

We will now use the now-common Python default dictionary. We find that this dataset has
INDELs, CNVs, and—of course—SNPs (roughly two-thirds being transitions with one-third
transversions). There is a residual number (79) of triallelic SNPs.

In [20]:
from collections import defaultdict
f = VCF('genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for variant in f:
  my_type[variant.var_type, variant.var_subtype] += 1
  if variant.var_type == 'snp':
    num_alts[len(variant.ALT)] += 1
print(my_type)

defaultdict(<class 'int'>, {('snp', 'ts'): 10054, ('snp', 'tv'): 5917, ('sv', 'CNV'): 2, ('indel', 'del'): 273, ('snp', 'unknown'): 79, ('indel', 'ins'): 127, ('indel', 'unknown'): 13, ('sv', 'DEL'): 6, ('sv', 'SVA'): 1})
