### 4 November 2019
# Download the latest Ensembl GRCh37.87 GFF3 genomic annotation file and verify its integrity
### by Pavlos Bousounis
***last updated 11/08/2019***

* SOURCE FTP DIRECTORY: ***ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/***
* FILE: ***Homo_sapiens.GRCh37.87.chr.gff3.gz***
* LAST MODIFIED: ***3/20/17, 12:00:00 AM***
* ASSEMBLY: ***GRCh37.87***


* FILE ACCESSED ON: ***4 Nov, 2019***

### Import modules

In [52]:
from datetime import datetime
from ftplib import FTP
import hashlib
import os
import pandas as pd
import pathlib
import re
import shutil
import subprocess

In [2]:
today = datetime.today().strftime('%Y-%m-%d')

print('Current working directory: {}\n'.format(os.getcwd()))
print('Today is: {}'.format(today))

Current working directory: /Users/pbousounis/Experiments/2019-10-29_hg19mod/2019-11-07_EnsemblGFF3_GRCh37-download_verify

Today is: 2019-11-08


### Set working directory

In [3]:
basedir = '/Users/pbousounis/Experiments/2019-10-29_hg19mod/2019-11-07_EnsemblGFF3_GRCh37-download_verify'
os.chdir(basedir)

### Define function ***fetch_ensembl()***

In [6]:
""" Given a FTP server address, user name, password, FTP filepath, and filename:
    1. Download the desired file from the specified FTP directory
    2. Download the associated md5checksum file
    3. Compare the md5checksums to verify file integrity """

def fetch_ensembl(server, user, passwd, path, filename):
    
    # specify domain name
    ftp = FTP(server)
    ftp.login(user = user, passwd = passwd)
    
    # specify FTP directory
    ftp.cwd(path)
    
    # create/specify output directory
    os.makedirs(filename, exist_ok = True)
    
    # prepare local file to be written according to remote file contents
    file_path_out = os.path.join(filename, filename)
    localfile = open(file_path_out, 'wb')
    
    # retrieve binary data from server and write to local file
    # buffering: 1024 chunks transferred at a time
    ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
    ftp.quit()
    localfile.close()
    
    if os.path.isfile(os.path.join(filename, filename)):
        print('Success! {} saved to {}'.format(filename, file_path_out))
    else:
        print('ERROR: file not downloaded.')

### Download the latest RefSeq GRCh37 GFF3 genomic annotation file from the NCBI FTP server to a new directory

In [7]:
# fetch the annotation file
fetch_ensembl('ftp.ensembl.org',
            'anonymous', 'pbousounis@childrensnational.org',
            '/pub/grch37/current/gff3/homo_sapiens/', 
             'Homo_sapiens.GRCh37.87.chr.gff3.gz')

Success! Homo_sapiens.GRCh37.87.chr.gff3.gz saved to Homo_sapiens.GRCh37.87.chr.gff3.gz/Homo_sapiens.GRCh37.87.chr.gff3.gz


### Fetch the associated checksum file if present

In [11]:
try:
    fetch_ensembl('ftp.ensembl.org',
                  'anonymous', 
                  'pbousounis@childrensnational.org',
                  '/pub/grch37/current/gff3/homo_sapiens/',
                  'CHECKSUMS')
except:
    print("WARNING: checksum file could not be located/downloaded.")
    shutil.rmtree('CHECKSUMS')
finally:
    print("Finished.")

Success! CHECKSUMS saved to CHECKSUMS/CHECKSUMS
Finished.


### Define function: ***unix_checksum()***

In [98]:
def unix_checksum(file, checksum_file):

    # calculate the GFF3 checksum
    tmp = str(subprocess.check_output(["cksum", "-o", "1", file]))
    file_checksum = re.search(r'(b.{1})(\d+ \d+)', tmp)[2]

    # get the associated checksum from the downloaded CHECKSUMS file
    checksums = pd.read_csv(checksum_file, sep='\s+', header=None)
    tmp = checksums[checksums[2] == file.split('/')[0]]
    checksum = str(tmp.iloc[0,0]) + ' ' + str(tmp.iloc[0,1])

    # Reference md5sum
    print('Reference CHECKSUM - {}'.format(checksum))
    # VCF calculated md5sum
    print('GFF3 CHECKSUM ------ {}\n'.format(file_checksum))

    if file_checksum == checksum:
        print("MATCH")
    else:
        print("ERROR: NO MATCH")

### Compare checksums (Unix algorithm 1 via cksum)

In [99]:
file = 'Homo_sapiens.GRCh37.87.chr.gff3.gz/Homo_sapiens.GRCh37.87.chr.gff3.gz'
checksum_file = 'CHECKSUMS/CHECKSUMS'

unix_checksum(file, checksum_file)

Reference CHECKSUM - 55529 36985
GFF3 CHECKSUM ------ 55529 36985

MATCH


## Results:

1. ***Homo_sapiens.GRCh37.87.chr.gff3.gz*** was downloaded from ***ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/*** to ***/Users/pbousounis/Experiments/2019-10-29_hg19mod/2019-11-07_EnsemblGFF3_GRCh37-download_verify/Homo_sapiens.GRCh37.87.chr.gff3.gz/***

2. ***CHECKSUM*** was downloaded from ***ftp.ensembl.org/pub/grch37/current/gff3/homo_sapiens/*** to ***/Users/pbousounis/Experiments/2019-10-29_hg19mod/2019-11-07_EnsemblGFF3_GRCh37-download_verify/CHECKSUM/***

3. Homo_sapiens.GRCh37.87.chr.gff3.gz - associated Unix checksum was extracted from CHECKSUM file and succesfully matched with the Unix checksum calculated from Homo_sapiens.GRCh37.87.chr.gff3.gz