NGS files are all about big data. Fortunately, the bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/),
you can download tabix and bgzip, which will take care of data management. On the command line, perform the following operation:
</br>
command:
</br>
`apt-get install tabix`

The bellow line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

In [19]:
conda install -c conda-forge -c bioconda tabix

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.5.0
    latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c conda-forge conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [3]:
# !wget https://github.com/samtools/htslib/releases/download/1.22.1/htslib-1.22.1.tar.bz2
# !tar -xf htslib-1.19.tar.bz2
# %cd htslib-1.19
# !autoreconf -i
# !./configure --enable-libcurl
# !make
# !make install

--2025-12-01 06:50:19--  https://github.com/samtools/htslib/releases/download/1.22.1/htslib-1.22.1.tar.bz2
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/4339773/2fcb9687-d889-4790-a4bd-efb525669bbd?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-12-01T07%3A47%3A23Z&rscd=attachment%3B+filename%3Dhtslib-1.22.1.tar.bz2&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-12-01T06%3A47%3A15Z&ske=2025-12-01T07%3A47%3A23Z&sks=b&skv=2018-11-09&sig=xGazf%2FAYFnQx4xPuU4p8QlFjVAOgxIyk0TJ4DYEldzw%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2NDU3MjEyMCwibmJmIjoxNzY0NTcxODIwLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdG

In [2]:
!tabix -fh https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000|bgzip -c > genotypes.vcf.gz

Then, We should create an index, which we will need for direct access to a section of the genome.

In [3]:
!tabix -p vcf genotypes.vcf.gz

**Let’s start by inspecting the information that we can get per record:**

We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). 
</br>
At the variant (record) level, we find AC—the total number of ALT alleles in called genotypes, AF—the estimated allele frequency, NS—the number of samples with data, AN—the total number of alleles in called genotypes, and DP—the total read depth. 
</br>
There are others,
but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as
possible). Your own dataset may have more annotations (or none of these).
At the sample level, there are only two annotations in this file: GT—genotype, and DP—the
per-sample read depth. You have the per-variant (total) read depth and the per-sample read
depth; be sure not to confuse both.

In [22]:
conda install cyvcf2

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.5.0
    latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cyvcf2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    click-8.3.1                |     pyh8f84b5b_1          95 KB  conda-forge
    coloredlogs-15.0.1         |     pyhd8ed1ab_4          43 KB  conda-forge
    cyvcf2-0.31.4              |  py311h6e87821_0         998 KB  bioconda
    htslib-1.22.1              |       h566b1c6_0         3.1 MB  bioconda
    humanfriendly-10.0         |     pyh707e725_8          72 KB  conda-forge
    libcurl-8.14.1             |       h332b0f4_0         439 KB  conda-forge
    liblzma-5.8.1       

In [1]:
# required package
from cyvcf2 import VCF
v = VCF('genotypes.vcf.gz')
rec = next(v)
print('Variant Level information')
info = rec.INFO
for info in rec.INFO:
  print(info)
print('Sample Level information')
for fmt in rec.FORMAT:
  print(fmt)

OSError: genotypes.vcf.gz is not valid bcf or vcf (format: 15 mode: r)

**Now that we know what information is available, let’s inspect a single VCF record:**

We will start by retrieving the standard information: the chromosome, position, ID, reference base
(typically just one) and alternative bases (you can have more than one, but it’s not uncommon
as a first filtering approach to only accept a single ALT, for example, only accept biallelic
SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status,
remember that whatever the VCF file says, you may still want to apply extra filters (as in the
next recipe, Studying genome accessibility and filtering SNP data).
We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on),
followed by the sample format (in this case, DP and GT). Finally, we count the number of
samples and inspect a single sample to check whether it was called for this variant. Also, the
reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which
is not that common) are included.

In [24]:
v = VCF('genotypes.vcf.gz')
samples = v.samples
print(len(samples))
variant = next(v)
print(variant.CHROM, variant.POS, variant.ID, variant.
REF, variant.ALT, variant.QUAL, variant.FILTER)
print(variant.INFO)
print(variant.FORMAT)
print(variant.is_snp)
str_alleles = variant.gt_bases[0]
alleles = variant.genotypes[0][0:2]
is_phased = variant.genotypes[0][2]
print(str_alleles, alleles, is_phased)
print(variant.format('DP')[0])

2504
22 16050075 None A ['G'] 100.0 None
<cyvcf2.cyvcf2.INFO object at 0x7f86b9ddff90>
['GT', 'DP']
True
A|A [0, 0] True
1


[W::vcf_parse_info] INFO 'SAS_AF' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS_AF' is not defined in the header, assuming Type=String


**Let’s check the type of variant and the number of nonbiallelic SNPs in a single pass:**

We will now use the now-common Python default dictionary. We find that this dataset has
INDELs, CNVs, and—of course—SNPs (roughly two-thirds being transitions with one-third
transversions). There is a residual number (79) of triallelic SNPs.

In [25]:
from collections import defaultdict
f = VCF('genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for variant in f:
  my_type[variant.var_type, variant.var_subtype] += 1
  if variant.var_type == 'snp':
    num_alts[len(variant.ALT)] += 1
print(my_type)

[W::vcf_parse_info] INFO 'SAS_AF' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS_AF' is not defined in the header, assuming Type=String


defaultdict(<class 'int'>, {('snp', 'ts'): 10054, ('snp', 'tv'): 5917, ('sv', 'CNV'): 2, ('indel', 'del'): 273, ('snp', 'unknown'): 79, ('indel', 'ins'): 127, ('indel', 'unknown'): 13, ('sv', 'DEL'): 6, ('sv', 'SVA'): 1})
