### 22 October 2019
# Download the latest ClinVar GRCh37 VCF clinical variants file and verify its integrity
### by Pavlos Bousounis
***last updated 11/07/2019***

1. Download the GRCh37 ClinVar variants file
2. Download the GRCh37 ClinVar variants checksum file
3. Compare checksum file with calculated checksum of downloaded file to verify download integrity

#### Input file metadata:
* SOURCE FTP DIRECTORY: ***ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/***
* FILE: ***clinvar.vcf.gz***
* LAST MODIFIED: ***4/24/18, 8:00:00 PM***
* ASSEMBLY: ***GRCh37***
* TYPE: 
* KNOWN REFSEQS FROM: 
* MODEL REFSEQS FROM: 
* FILE ACCESSED ON: ***10 Oct, 2019***

### Import modules

In [2]:
from datetime import datetime
from ftplib import FTP
import hashlib
import os
import pathlib
import shutil

In [3]:
today = datetime.today().strftime('%Y-%m-%d')

print('Current working directory: {}\n'.format(os.getcwd()))
print('Today is: {}'.format(today))

Current working directory: /Users/pbousounis/Experiments/2019-10-29_hg19mod/ClinVarVCF-GRCh37_download_validate

Today is: 2019-11-07


### Set working directory

In [9]:
basedir = '/Users/pbousounis/Experiments/2019-10-29_hg19mod/RefSeqGFF3_GRCh37-download_validate'
os.chdir(basedir)

### Define function ***fetch_ncbi()***

In [4]:
""" Given a FTP server address, user name, password, FTP filepath, and filename:
    1. Download the desired file from the specified FTP directory
    2. Download the associated md5checksum file
    3. Compare the md5checksums to verify file integrity """

def fetch_ncbi(server, user, passwd, path, filename):
    
    # specify domain name
    ftp = FTP(server)
    ftp.login(user = user, passwd = passwd)
    
    # specify FTP directory
    ftp.cwd(path)
    
    # create/specify output directory
    os.makedirs(filename, exist_ok = True)
    
    # prepare local file to be written according to remote file contents
    file_path_out = os.path.join(filename, filename)
    localfile = open(file_path_out, 'wb')
    
    # retrieve binary data from server and write to local file
    # buffering: 1024 chunks transferred at a time
    ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
    ftp.quit()
    localfile.close()
    
    if os.path.isfile(os.path.join(filename, filename)):
        print('Success! {} saved to {}'.format(filename, file_path_out))
    else:
        print('ERROR: file not downloaded.')

### Define function ***check_md5()***

In [5]:
def check_md5(file, checksum_file):
    
    """ Given a newly downloaded database file:
        * Compare md5 checksums of downloaded file ('file') and associated md5 file ('checksum_file') """
    
    # Open the downloaded checksum file
    with open(checksum_file, 'rb') as checksumFile:
        md5Checksum = checksumFile.read()
        ref_md5 = str(md5Checksum.split()[0]).split("\'")[1]

    # Calculate the downloaded VCF md5sum
    md5_hash = hashlib.md5()

    with open(file,'rb') as f:

        # Read and update hash in chunks of 4K
        for byte_block in iter(lambda: f.read(4096),b""):
            md5_hash.update(byte_block)

        vcf_md5 = md5_hash.hexdigest()

    # Reference md5sum
    print(f'Reference md5sum - {ref_md5}')
    # VCF calculated md5sum
    print(f'VCF md5sum ------- {vcf_md5}\n')

    if ref_md5 == vcf_md5:
        print("MATCH")
    else:
        print("ERROR: NO MATCH")

### Download the latest ClinVar GRCh37 VCF clinical variants file from the NCBI FTP server to a new directory

In [9]:
# fetch the annotation file
fetch_ncbi('ftp.ncbi.nlm.nih.gov',
             'anonymous', 'pbousounis@childrensnational.org',
             '/pub/clinvar/vcf_GRCh37/', 
             'clinvar.vcf.gz')

Success! clinvar.vcf.gz saved to clinvar.vcf.gz/clinvar.vcf.gz


### Fetch the associated checksum file if present

In [10]:
try:
    fetch_ncbi('ftp.ncbi.nlm.nih.gov',
                 'anonymous', 'pbousounis@childrensnational.org',
                 '/pub/clinvar/vcf_GRCh37/', 
                 'clinvar.vcf.gz.md5')
except:
    print("WARNING: checksum file could not be located/downloaded.")
    shutil.rmtree('clinvar.vcf.gz.md5')
finally:
    print("Finished.")

Success! clinvar.vcf.gz.md5 saved to clinvar.vcf.gz.md5/clinvar.vcf.gz.md5
Finished.


In [13]:
os.getcwd()

'/Users/pbousounis/Experiments/2019-10-29_hg19mod/ClinVarVCF-GRCh37_download_validate'

### Verify download integrity

In [14]:
file = 'clinvar.vcf.gz/clinvar.vcf.gz'
checksum_file = 'clinvar.vcf.gz.md5/clinvar.vcf.gz.md5'
check_md5(file, checksum_file)

Reference md5sum - 9448df54be6b6700ab10b81d968fe5ce
VCF md5sum ------- 9448df54be6b6700ab10b81d968fe5ce

MATCH


## Results:

1. ***clinvar.vcf.gz*** was downloaded from ***ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37*** to ***/Users/pbousounis/Experiments/2019-10-29_hg19mod/ClinVarVCF-GRCh37_download_validate/clinvar.vcf.gz/***

2. ***clinvar.vcf.gz.md5*** was downloaded from ***ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37*** to ***/Users/pbousounis/Experiments/2019-10-29_hg19mod/ClinVarVCF-GRCh37_download_validate/clinvar.vcf.gz.md5/***

3. ***clinvar.vcf.gz*** and ***clinvar.vcf.gz.md5*** md5sums MATCH = OK