# Whole-Genome Sequence of Mycobacterium ulcerans CSURP7741, a French Guianan Clinical Isolate

[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6639603/)

Combined Nanopore and Illumina whole-genome sequencing of a French Guianan Mycobacterium ulcerans (Buruli ulcer agent) clinical isolate yielded a 5.12-Mbp genome with a 65.5% GC content, 5,215 protein-coding genes, and 51 predicted RNA genes. This publicly available M. ulcerans whole-genome sequence from a strain isolated in South America is closely related to M. ulcerans subsp. liflandii.

Reference genome: [Mycobacterium ulcerans Agy99, complete genome](https://www.ncbi.nlm.nih.gov/nuccore/CP000325.1)
- Accesion number: CP000325
- Version: CP000325.1

In [2]:
export PATH="$PATH:/Users/julia/tools/sratoolkit.3.0.7-mac64/bin/"
fastq-dump


Usage:
  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>

Use option --help for more information

fastq-dump : 2.8.0



: 1

In [13]:
cd /Users/julia/Github/bioinfo-pipelines

### File prefixes

In [3]:
REF_GENOME="data/ref_genomes/agy99.fasta"
FASTQ_1="P7741_R1"
FASTQ_2="P7741_R2"

## 1. Alignment with reference genome

Burrows-Wheeler Alignment Tool

* [bwa documentation](https://bio-bwa.sourceforge.net/bwa.shtml#3)
* [samtools documentation](http://www.htslib.org/doc/samtools.html)
* [vt](https://genome.sph.umich.edu/wiki/Vt)
* [bcftools](https://samtools.github.io/bcftools/bcftools.html) - [bcftools releases](https://github.com/samtools/bcftools/releases/)

bwa mem outputs are the corresponding **bam** files. Samtools sort take the bam files and sorts them.

In [None]:
bwa index $REF_GENOME

In [8]:
bwa mem $REF_GENOME 'data/sequence_reads/'$FASTQ_1'.fastq' | samtools sort -o 'results/01_bam/'$FASTQ_1'.bam'
bwa mem $REF_GENOME 'data/sequence_reads/'$FASTQ_2'.fastq' | samtools sort -o 'results/01_bam/'$FASTQ_2'.bam'

[M::mem_process_seqs] Processed 47008 reads in 25.760 CPU sec, 27.561 real sec
[M::process] read 45338 sequences (10000087 bp)...
[M::mem_process_seqs] Processed 46712 reads in 27.071 CPU sec, 30.060 real sec
[M::process] read 42516 sequences (10000310 bp)...
[M::mem_process_seqs] Processed 45338 reads in 26.885 CPU sec, 27.496 real sec
[M::process] read 42208 sequences (10000159 bp)...
[M::mem_process_seqs] Processed 42516 reads in 34.053 CPU sec, 35.235 real sec
[M::process] read 41998 sequences (10000230 bp)...
[M::mem_process_seqs] Processed 42208 reads in 35.465 CPU sec, 37.226 real sec
[M::process] read 7291 sequences (1755904 bp)...
[M::mem_process_seqs] Processed 41998 reads in 32.988 CPU sec, 37.851 real sec
[M::mem_process_seqs] Processed 7291 reads in 6.111 CPU sec, 6.649 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem data/ref_genomes/agy99.fasta data/sequence_reads/P7741_R1.fastq
[main] Real time: 202.224 sec; CPU: 188.457 sec
[M::bwa_idx_load_from_disk] read 0 A

In [11]:
ls -lt results/01_bam/*bam

-rw-r--r--  1 julia  staff  44717689 10 oct 19:29 results/01_bam/P7741_R2.bam
-rw-r--r--  1 julia  staff  36670331 10 oct 19:25 results/01_bam/P7741_R1.bam


In [10]:
samtools index 'results/01_bam/'$FASTQ_1'.bam'
samtools index 'results/01_bam/'$FASTQ_2'.bam'

In [12]:
ls -lt results/01_bam/*bai

-rw-r--r--  1 julia  staff  16072 10 oct 19:39 results/01_bam/P7741_R2.bam.bai
-rw-r--r--  1 julia  staff  15752 10 oct 19:39 results/01_bam/P7741_R1.bam.bai


## 2. Generate VCF file
### Command options
`bcftools mpileup`:
* `-a` - Annotate the vcf - here we add allelic depth (AD), genotype depth (DP) and strand bias (SP).

`bcftools call`:
* `-f` - comma-separated list of FORMAT fields to output for each sample - here they are genotype quality (GQ) and genotype probability (GP).
* `-m` - use bcftools multiallelic caller
* `-O` - specify the output type, here it is z - i.e. gzipped (compressed) vcf (`-Ov`: uncompressed VCF, `-Ou` option when piping between bcftools subcommands to speed up performance by removing unnecessary compression/decompression)

In [32]:
ls -lt results/01_bam/*.bam

-rw-r--r--  1 julia  staff  44717689 10 oct 19:29 results/01_bam/P7741_R2.bam
-rw-r--r--  1 julia  staff  36670331 10 oct 19:25 results/01_bam/P7741_R1.bam


In [15]:
bcftools mpileup -Ou -f $REF_GENOME 'results/01_bam/*.bam' \
    | bcftools call -mv -Ov -o 'results/02_vcf/snps.vcf'

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 2 samples in 2 input files


In [34]:
bcftools mpileup -a AD,DP,SP -Ou -f $REF_GENOME results/01_bam/*.bam \
    | bcftools call -f GQ,GP -mO z -o results/02_vcf/snps.vcf.gz

Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 2 samples in 2 input files


## Clean VCF file

All the **#CHROM fields are CP000325.1**, indicating that all of them belong to the same chromosome as the **Mycobacterium ulcerans Agy99** (bacteria) has a **single circular chromosome**.

In [10]:
bcftools index -f -c 'results/02_vcf/snps.vcf.gz'
ls -lt results/02_vcf/*

-rw-r--r--  1 julia  staff      4086 23 oct 16:25 results/02_vcf/snps.vcf.gz.csi
-rw-r--r--  1 julia  staff  45947933 20 oct 12:26 results/02_vcf/snps.vcf.gz
-rw-r--r--  1 julia  staff   4883405 11 oct 17:20 results/02_vcf/snps.vcf


In [12]:
sed '/^##/d' 'results/02_vcf/snps.vcf' | awk 'length($4) > 1' | awk '{print $1}' | uniq

#CHROM
CP000325.1


In [3]:
# returns the lines with information, stating with #
bcftools view -h 'results/02_vcf/snps.vcf.gz'

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.5+htslib-1.5
##bcftoolsCommand=mpileup -a AD,DP,SP -Ou -f data/ref_genomes/agy99.fasta results/01_bam/P7741_R1.bam results/01_bam/P7741_R2.bam
##reference=file://data/ref_genomes/agy99.fasta
##contig=<ID=CP000325.1,length=5631606>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is be

In [None]:
# All of them are region CP000325.1 
bcftools view --regions CP000325.1 'results/02_vcf/snps.vcf.gz' | sed '/^##/d' | head

bcftools norm \
    -m-any \
    --check-ref \
    -w \
    -f $REF_GENOME \
    results/02_vcf/snps.vcf.gz -o results/02_vcf/snps_norm.vcf.gz

In [2]:
#bcftools view -H results/02_vcf/snps.vcf.gz | head
sed '/^##/d' 'results/02_vcf/snps.vcf' | awk 'length($4) > 1' | awk '{print $2,$3,$4,$5,$6,$7}'

POS ID REF ALT QUAL FILTER
1627 . CG CGG 359 .
5178 . GAA GAAA 486 .
12177 . CG C 349 .
13519 . CGG CG 403 .
14843 . CAT C 12.9655 .
25291 . GCTGGCCGTA GCTGGCCGTACTCGGGCGGCGCCTGACCGTAGTCGGATGGGGCCTGGCCGTA 180 .
25788 . TGGTAGTCGCGCTGGTCGGGGTAGCCGCGCTGGTCTTGGTAGCCGCGCTGGTCGGGGTAGCCGCGCTGGTCGGGGTAGCCGC TGGTAGCCGCGCTGGTCGGGGTAGCCGC 48.8973 .
28382 . CGGG CGG 106 .
33439 . CCGACGAAGCCGCGGACGAAGTCGTTGGCGGGTTCG CCG 411 .
34512 . TCCCCC TCCCCCC 287 .
34881 . TAATG TAATGAATG 469 .
34966 . TC TCCC 473 .
50436 . ACCCCCCC ACCCCCCCCC 98 .
50557 . CGCCTGAGCCGGCGCCGTCGCGGCCGCCGGGTGATGAGGTATCGACACGGGTGCCTGAGC CGCCTGAGC 156 .
51839 . CCCCCTCTGGTGTCCCCTC CCCCCTC 449 .
56049 . CA CAA 486 .
56324 . GCCCCCCCC GCCCCCC 300 .
56808 . CGGGGGGGGGG CGGGGGG 434 .
57243 . GATTATT GATT 471 .
57445 . AACGTGTTGGG AACGTGTTGGGACGTGTTGGG 471 .
60658 . GCC GC 344 .
61074 . CGGGG CGGGGG 143 .
63663 . CGGG CGG 70 .
63666 . GT G 275 .
69902 . ACACCAC ACACCACCAC 10.3157 .
70255 . CGCGGCGGCGG CGCGGCGGCGGCGG 471 .
70733 . CGA

528685 . CAA CA 468 .
528792 . ATT AT 416 .
531587 . GTT GT 267 .
540849 . CGG CG 447 .
548517 . TGG TGGG 286 .
548632 . GACCGGTAAGAACCGGTAAGAACCGGTAAGAA GACCGGTAAGAACCGGTAAGAA 471 .
549846 . TCC TC 316 .
553270 . CTT CT 433 .
563086 . GAT GATGGCCCTGTATTTCGTCAACTACCTGGACCGCACCAACCTCGGCAT 141 .
564958 . ATTTCGACTCGATCGCGTTTCGACTCGATCGCG ATTTCGACTCGATCGCG 483 .
567299 . GCAC GC 17.151 .
568369 . GCCCCC GCCCCCC 91 .
568558 . GCGGCGGCGGCCA GCGGCGGCGGCCATGGGGGCCAGTTCGGCGGCGGCCA 391 .
568762 . CTTT CTTTT 152 .
569202 . CG CGTG 483 .
570665 . ACCC ACCCC 184 .
571797 . ACC ACCC 328 .
574922 . AACGCCAT AACGCCATACGCCAT 405 .
575793 . CCGCCGT CCGCCGTCGGCCCCCGTCCCCAGGGCGCCGT 188 .
579592 . CACCGGAGCCG CACCGGAGCCGACGGAGCGGCCGGCACCGACGGCACTACCGGAGCCG 103 .
580013 . CGCGGCGGGACCGGCGGCGGCGGGACCGGCGGCGGCGGGACCGGCGGCGGCGG CGCGGCGGGACCGGCGGCGGCGG 229 .
589108 . CG CGG 305 .
589192 . TTCTGTTG TTCTGTTGTCTGTTG 331 .
592707 . CGGTCTTGCTGGTCAGGCGGG CGGTCTTGCTGGTCAGGCGGGTCTTGCTGGTCAGGCGGG 247 .
592765 . CTTGCC

1187579 . CGGGGG CGGGGGG 122 .
1189753 . CGCGGCGGCGGC CGCGGCGGC 465 .
1191092 . GGGTGG GGG 481 .
1194023 . ATTTT ATTTTT 134 .
1200004 . CTT CT 388 .
1200752 . CGG CGGG 3.28543 .
1200883 . CCGCGCGCGCG CCGCGCGCG 175 .
1207624 . CCCGCCGTTGCCGCCGT CCCGCCGTTGCCGCCGTTGCCGCCGT 139 .
1207702 . GCCCCC GCCCCCC 38.0542 .
1209947 . CCG CCGGCCCCGTCG 33.0489 .
1210811 . GGA G 4.76678 .
1211347 . CGGG CGG 377 .
1214061 . CTGTTATTGT CTGTTATTGTGTTATTGT 469 .
1214723 . AGC AGCGAGTGCTTGTGCTTGTGC 179 .
1218635 . TCCCCCCC TCCCCC 349 .
1221390 . ATGC A 138 .
1221391 . TGCGGCGGCGGCGGC TGCGGCGGCGGC 475 .
1222235 . AACGGCGGCG A 130 .
1225913 . TGAGTGAGAATCTAGTGAGAATCGA TTAGTGAGAATCGA 484 .
1225914 . GAGTGAGAATCTAGTGAGAATC GAGTGAGAATC 486 .
1234305 . ATCCCGGTGGGTC ATCCCGGTGGGTCCCGGTGGGTC 55 .
1237923 . CGGCAATGGCGGCGGTGCCGGTGCCGGCGGCGCGGGGGGCAATGGCGGCGATGCCGGTGCCGGCG CGGCAATGGCGGCGATGCCGGTGCCGGCG 20.65 .
1238063 . GCAACGGCGGCAACGGCG GCAACGGCGGCAACGGCGGCAACGGCGTCAACGGCGGCAACGGCG 287 .
1238714 . CGGGGGGGGGGGG CGG

1826057 . AG AGG 486 .
1828658 . GCGGTCTTTGCGGCCAATAGATCGGTCTTTGCGGCCAATAGATCGGT GCGGTCTTTGCGGCCAATAGATCGGT 414 .
1831261 . AGGGG AGGGGG 195 .
1833839 . GTT GT 415 .
1840896 . ATTCCGCCGTT AT 476 .
1840898 . TCCGCCGTTGCCGCCGTTGCCG TCCGCCGTTGCCG 470 .
1840911 . CCGTTGCCGA C 486 .
1841511 . GAAAAAAAA GAAAAAAA 80 .
1841723 . GTTTTTTT GTTTTTTTT 187 .
1850328 . CTTT CTT 66 .
1850504 . CAAAA CAAA 85 .
1864944 . GCCCC GCCC 186 .
1868692 . AT ATT 361 .
1873621 . GT G 246 .
1877148 . CAAA CAAAA 190 .
1877237 . AT ATT 21.8049 .
1877238 . TATGCCCAACTAA T 371 .
1877250 . AT ATT 23.0359 .
1884130 . ATTT ATTTT 4.79533 .
1885667 . GGACCAGCAGCGTGACCAGCAGC GGACCAGCAGC 471 .
1885668 . GACCAGCAGCGT G 471 .
1888914 . GAT G 483 .
1892395 . GCCGCCGTTGGCC G 288 .
1892408 . GCCCCCCC GCCCC 36.2494 .
1892691 . TGGG TGG 240 .
1893137 . CCCGCCGGCGCCGCCG CCCGCCG 33.0489 .
1894135 . ATGCCGCCACTGCCGCCGGTGCCAGGGGTGCCGCCGGTGCCGCC ATGCCGCC 5.70844 .
1894144 . CTGCCGCCGGTGCC CTGCCGCCGGTGCCGCCGGTGCC 280 .
1894253 . TGTTGC

2614601 . CG CGG 483 .
2614903 . CCGTCGTC CCGTC 484 .
2618339 . CGG CG 384 .
2623363 . GCCACCACCACCACCACC GCCACCACCACCACC 153 .
2624792 . ACCCC ACCCCC 264 .
2625510 . CA CAA 322 .
2630120 . GC GCC 239 .
2632263 . CGCTACCCCGGCTACC CGCTACCCCGGCTACCCCGGCTACC 83 .
2633888 . CCGGATCCGGATC CC 338 .
2633900 . CAA CAAA 41.6318 .
2634122 . CTTT CTTTT 439 .
2634771 . GC GCC 411 .
2635682 . GCCCCCC GCCCCC 137 .
2641720 . TGCCGCCGCCGCCG TGCCGCCGCCG 143 .
2642036 . TCCCCC TCCCCCC 243 .
2644019 . CAAAAA CAAAA 108 .
2644386 . TCATCATCGGCTCCAGCAGCTCCATCATC TCATCATC 441 .
2655090 . TGGG TG 483 .
2655454 . CGTTCCTTAAATTGTTCCTTAAATT CGTTCCTTAAATT 465 .
2659444 . TACCCGACCCGA TACCCGA 11.0069 .
2666872 . TGGGGGGGGGGGGGGGGGGG TGGGGGGG 472 .
2672331 . GCCCCCC GCCCCCCCCC 73 .
2672889 . ATTATTT ATT 10.2828 .
2672892 . ATTT ATT 406 .
2674204 . GC GCC 286 .
2676730 . GC G 414 .
2680110 . AG A 428 .
2684033 . TGGGGG TGGGGGG 217 .
2686380 . GAAAAAA GAAAAAAA 80 .
2698555 . CCCG CCCGGCAGCCGCCGCGACTCGCCG 424 .
269863

3427634 . ATAATCCATCGCCATTTTAATCCATCGCCATTT ATAATCCATCGCCATTT 480 .
3429538 . GCCCCC GCCCCCC 142 .
3429665 . CGG CGGG 322 .
3429791 . AATTATTGCATAGGGAACA AA 473 .
3433471 . TCCTCGGCCTCGGCCTCGGCCTCGGCCTC TCCTCGGCCTCGGCCTCGGCCTC 476 .
3436442 . CAA C 10.6395 .
3437466 . ACCCCC ACCCCCC 243 .
3438475 . CGGGGGG CGGGGGGG 119 .
3440037 . CG CGG 158 .
3443172 . ACC AC 340 .
3443483 . CTTTTT CTTTT 254 .
3444604 . CGCCGCCGGCACCACCGGAGCCGCCGGCACCACCGGA CGCCGCCGGCACCACCGGA 134 .
3444697 . GTTT GTTTT 234 .
3445225 . AGCCGTCGCCGCCGTCGCC AGCCGTCGCC 33.0489 .
3445405 . CAA CA 69 .
3445504 . TGCCGTCAGCCCCGCCGGCGCCG TGCCG 268 .
3445706 . ACCGGCACCGCCA ACCGGCACCGCCAGCGCCTCCGGCACCGCCA 131 .
3451724 . CCAC CC 41.0781 .
3451726 . ACCCC ACC 486 .
3454225 . TC TCC 331 .
3457776 . TGGGGG TGGGG 199 .
3460198 . GCGC GCACGC 471 .
3464453 . GGC GGCGC 483 .
3464807 . ATCTC ATCTCTC 486 .
3465547 . GCCGGTGGCACCGGTGGCACCGGTGG GCCGGTGGCACCGGTGG 153 .
3470576 . AT A 359 .
3471149 . AGGGGGG AGGGGG 95 .
3472191 . ACCC ACC

3983606 . TGAGG TGAGGAGG 486 .
3984827 . GC GCC 241 .
3984950 . GCC GC 430 .
3991418 . GCCCCCC GCCCCC 226 .
3993016 . GGCGC GGC 411 .
4001140 . ACTGGCCGTGAACTCTGGCCGTGAACT ACTGGCCGTGAACT 261 .
4018546 . GCCC GCC 307 .
4027309 . GCCCCC GCCCCCCCCC 388 .
4029263 . GA G 405 .
4055023 . GC GCC 419 .
4057085 . GAAAAA GAAAA 186 .
4059473 . ACCCCC ACCCCCC 49.8265 .
4060107 . GCCCAAATCCAGCTACTTTAATA GCCCAAATCCAGCTACTTTAATACCCAAATCCAGCTACTTTAATA 290 .
4060114 . TCCAGCTACTTTAATA TCCAGCTACTTTAATACCCAAACCCAGCTACTTTAATA 284 .
4062734 . GCCCCC GCCCCCC 157 .
4065720 . CGGG CGG 486 .
4066145 . CCGCGC CCGC 486 .
4066366 . CG CGG 382 .
4067809 . TTCGATGGCGCACCAGCAGAACACAGCTCGATGGCGCACCAGCAGAACACAGCTCG TTCGATGGCGCACCAGCAGAACACAGCTCG 451 .
4070130 . CT C 201 .
4075865 . CG CGG 362 .
4079673 . TCCCAGCAGGCCGATACCCAGCAGG TCCCAGCAGG 474 .
4079683 . CCGATACCCAGCAGGGC CC 486 .
4085039 . TCGGCGG TCGGCGGTGCCGGCGACGGCACCGGCGACGGCGGCGACGGCGG 107 .
4086509 . GACGGGGGTGATGGCGGGGCCGGCGGGCACGGGGGTGATGGCGGGGCCGGCGGG GACG

4766840 . CCGGCGGC CCGGCGGCGACGGCGGC 304 .
4770735 . GTT GT 206 .
4774406 . GCC GC 407 .
4789784 . GGCCGGGGTTCGGGCGCCCGCCGGGGTTCGGGCGCC GGCCGGGGTTCGGGCGCC 446 .
4789960 . CGC CGCGGCCGGCGCTTCCGGGGC 479 .
4798499 . CG CGG 483 .
4803193 . GAAA GAA 384 .
4805743 . CT C 452 .
4807144 . ACC AC 295 .
4807372 . GGT GGTGT 355 .
4809456 . ACCAGCACGATGCCCAATGTCATCCGGGCGATGGCCAGC ACCAGC 355 .
4811494 . CGCCGGC CGCCGGCCGGC 467 .
4811713 . GCC GCCC 353 .
4817875 . GCCCCC GCCCC 119 .
4817886 . CA C 207 .
4818859 . GCCCC GCCC 54 .
4820367 . GACAAC GAC 482 .
4823727 . TGGG TGG 255 .
4828326 . GAGCAGGTGGGCATGCAGCCTCGAGGTCAG GAG 469 .
4832346 . CCGCG CCG 483 .
4838880 . CA CAA 467 .
4840091 . CA C 345 .
4846351 . GCCGGCCGCACCACCGG GCCGGCCGCACCACCGGAACCGGCCGCACCACCGG 296 .
4846356 . CCGCACCACCGG CCGCACCACCGGAACCGGTCGCACCACCGG 291 .
4846605 . CGG CGGG 27.2135 .
4847710 . GCCATCACCACC GCC 107 .
4855534 . CAGACCTACCC CAGACCTACCCAAGACCTACCC 266 .
4857822 . CCAAGATCAAGAT CCAAGAT 420 .
4859368 . GCCCCCCC GCCCCC

5452773 . ACCGGCGCCGGCGCCGGC ACCGGCGCCGGCGCCGGCGCCGGC 484 .
5474181 . CA CAA 305 .
5478116 . TGGG TGGGG 357 .
5478319 . ACCCCCCC ACCCCCC 101 .
5478506 . TCGCCGCCGCC TCGCCGCC 474 .
5480477 . GGCCGCC GGCC 16.0843 .
5482796 . TGGGG TGGGGG 330 .
5482974 . TCCCC TCCCCC 174 .
5489917 . GCGCCGCGGCGGGGACGGTACGCCGCGGCG GCGCCGCGGCG 319 .
5495526 . TGG TG 483 .
5497880 . CAA CA 352 .
5498282 . GGTGCTCGTCGTGCTCGTCG GGTGCTCGTCG 397 .
5498525 . GA GAA 486 .
5506312 . TGCCGCTGCTGGCGCCGCTGCT TGCCGCTGCT 482 .
5508259 . ACGGCGCCGG ACGGCGCCGGCGCCGG 398 .
5518912 . CG C 278 .
5519461 . GTT GT 421 .
5521980 . CACGGCGGCAACGCCGGCAACGGCG CACGGCGGCAACGCCGGCAACGGCGGCAACGCCGGCAACGGCG 10.5115 .
5522251 . CCGGCGGCGGCGGC CCGGCGGCGGC 134 .
5523576 . CAGGGCGGTGAGGG CAGGGCGGTGAGGGCGGTGAGGG 339 .
5523668 . GGGCGGCCTCGGCGGC GGGCGGC 261 .
5523675 . CTCGGCGGCG C 308 .
5523733 . GCCCC GCC 233 .
5536666 . ATTTT ATTTTT 298 .
5536818 . TATTTACCAAACCGAATATTGCATTTACCAAACCGAATATTGC TATTTACCAAACCGAATATTGC 268 .
5537993 . GCATACGG

In [6]:
bcftools norm -m +both results/02_vcf/snps.vcf.gz | awk '$2 == "1627" {print $2,$3,$4,$5,$6,$7}'

1627 . CG CGG 381 .
Lines   total/split/realigned/skipped:	5111345/0/0/0


In [2]:
bcftools norm -a . -f $REF_GENOME results/02_vcf/snps.vcf.gz

bcftools: invalid option -- a

About:   Left-align and normalize indels; check if REF alleles match the reference;
         split multiallelic sites into multiple rows; recover multiallelics from
         multiple rows.
Usage:   bcftools norm [options] <in.vcf.gz>

Options:
    -c, --check-ref <e|w|x|s>         check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]
    -D, --remove-duplicates           remove duplicate lines of the same type.
    -d, --rm-dup <type>               remove duplicate snps|indels|both|any
    -f, --fasta-ref <file>            reference sequence (MANDATORY)
    -m, --multiallelics <-|+>[type]   split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]
        --no-version                  do not append version and command line to the header
    -N, --do-not-normalize            do not normalize indels (with -m or -c s)
    -o, --output <file>               write output to a file [standard output]
    -O, --out

: 1