### 22 October 2019
# Download the latest RefSeq GRCh37 GFF3 genomic annotation file and verify its integrity
### by Pavlos Bousounis
***last updated 11/07/2019***

* SOURCE FTP DIRECTORY: ***ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers***
* FILE: ***GRCh37_latest_genomic.gff.gz***
* LAST MODIFIED: ***4/24/18, 8:00:00 PM***
* ASSEMBLY: ***GRCh37.p13***
* TYPE: ***interim (current known RefSeqs)***
* KNOWN REFSEQS FROM: ***1-13-2017***
* MODEL REFSEQS FROM: ***none included***

* FILE ACCESSED ON: ***11 November, 2019***

### Import modules

In [30]:
from datetime import datetime
from ftplib import FTP
import hashlib
import os
import pathlib
import shutil

In [31]:
today = datetime.today().strftime('%Y-%m-%d')

print('Current working directory: {}\n'.format(os.getcwd()))
print('Today is: {}'.format(today))

Current working directory: /Users/pbousounis/Experiments/2019-10-29_hg19mod/RefSeqGFF3_GRCh37-download_validate

Today is: 2019-11-07


### Set working directory

In [9]:
basedir = '/Users/pbousounis/Experiments/2019-10-29_hg19mod/RefSeqGFF3_GRCh37-download_validate'
os.chdir(basedir)

### Define function ***fetch_refseq()***

In [12]:
""" Given a FTP server address, user name, password, FTP filepath, and filename:
    1. Download the desired file from the specified FTP directory
    2. Download the associated md5checksum file
    3. Compare the md5checksums to verify file integrity """

def fetch_refseq(server, user, passwd, path, filename):
    
    # specify domain name
    ftp = FTP(server)
    ftp.login(user = user, passwd = passwd)
    
    # specify FTP directory
    ftp.cwd(path)
    
    # create/specify output directory
    os.makedirs(filename, exist_ok = True)
    
    # prepare local file to be written according to remote file contents
    file_path_out = os.path.join(filename, filename)
    localfile = open(file_path_out, 'wb')
    
    # retrieve binary data from server and write to local file
    # buffering: 1024 chunks transferred at a time
    ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
    ftp.quit()
    localfile.close()
    
    if os.path.isfile(os.path.join(filename, filename)):
        print('Success! {} saved to {}'.format(filename, file_path_out))
    else:
        print('ERROR: file not downloaded.')

### Define function ***check_md5()***

In [13]:
def check_md5(file, checksum_file):
    
    """ Given a newly downloaded database file:
        * Compare md5 checksums of downloaded file ('file') and associated md5 file ('checksum_file') """
    
    # Open the downloaded checksum file
    with open(checksum_file, 'rb') as checksumFile:
        md5Checksum = checksumFile.read()
        ref_md5 = str(md5Checksum.split()[0]).split("\'")[1]

    # Calculate the downloaded VCF md5sum
    md5_hash = hashlib.md5()

    with open(file,'rb') as f:

        # Read and update hash in chunks of 4K
        for byte_block in iter(lambda: f.read(4096),b""):
            md5_hash.update(byte_block)

        vcf_md5 = md5_hash.hexdigest()

    # Reference md5sum
    print(f'Reference md5sum - {ref_md5}')
    # VCF calculated md5sum
    print(f'VCF md5sum ------- {vcf_md5}\n')

    if ref_md5 == vcf_md5:
        print("MATCH")
    else:
        print("ERROR: NO MATCH")

### Download the latest RefSeq GRCh37 GFF3 genomic annotation file from the NCBI FTP server to a new directory

In [20]:
# fetch the annotation file
fetch_refseq('ftp.ncbi.nlm.nih.gov',
            'anonymous', 'pbousounis@childrensnational.org',
            '/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/', 
             'GRCh37_latest_genomic.gff.gz')

Success! GRCh37_latest_genomic.gff.gz saved to GRCh37_latest_genomic.gff.gz/GRCh37_latest_genomic.gff.gz
SUCCESS!


### Fetch the associated checksum file if present

In [25]:
try:
    fetch_refseq('ftp.ncbi.nlm.nih.gov',
                 'anonymous', 'pbousounis@childrensnational.org',
                 '/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/', 
                 'GRCh37_latest_genomic.gff.gz.md5')
except:
    print("WARNING: checksum file could not be located/downloaded.")
    shutil.rmtree('GRCh37_latest_genomic.gff.gz.md5')
finally:
    print("Finished.")

Finished.


## Results:

1. ***GRCh37_latest_genomic.gff.gz*** was downloaded from ***ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/*** to ***/Users/pbousounis/Experiments/2019-10-29_hg19mod/RefSeqGFF3_GRCh37-download_validate***

2. No associated checksum file was found on the server