# VIRAL GENOME EXTRACTION USING URLLIB (PYTHON 3)
This program uses a text file of Viral data to grab the genomes of the corresponding viruses. 
I found part of this pipeline on this [blog](http://www.cureffi.org/2013/01/25/aligning-unmapped-reads-to-viral-genomes/), but I modified it for the purpose of our group. We want to grab all the viral genomes and apply a machine learning approach to use the k-mers in the viral genomes to test a piece of software called VirFinder (or even to create our own version.

### First we need to import modules (we will only use 2)

In [2]:
from urllib3.exceptions import HTTPError
import urllib.request

### Now we create lists to store all the info that we want
I have split the list section into 4 parts: first we are gathering ALL the info from the text file.
'virus' contains all the viruses from the text file. 
'bp_ID_accnum' contains all the bioproject IDs. 
'vt' contains all the virus types from the text file (ie ss RNA, ds DNA, RETRO). 
'v_size' contains all the virus sizes in kb. 

The next 2 parts contain the same type of information, however, in the 'yes' lists they contain the corresponding information that passed through the if statement and the try and except statement below. The 'no' lists, did not, and thus are saved to a different set of lists. 

Lastly, we have a 'csv' list that contains all the files with genomes of viruses that passed through the if statement and try and except statement.

In [3]:
virus = []
bp_ID_accnum = []
vt = []
v_size = []

yes_genome = []
yes_v_size = []
yes_vt = []

no_genome = []
no_v_size = []
no_vt = []

csv = []

#### This is just a quick file opener. 
Inside this file opener, we have read in the lines, split the line on the tab, and appended the desired information to our lists above.

In [4]:
with open('small_virus.txt', 'r') as in_file:
    for line in in_file.readlines()[1:]:
        virus.append(line.split('\t')[0])
        bp_ID_accnum.append(line.split('\t')[3])
        vt.append(line.split('\t')[4])
        v_size.append(line.split('\t')[6])

In [6]:
for i in bp_ID_accnum[0:1]:
    print(i)

15425


###  This for loop looks a little busy, but it can be broken down pretty simply
First, we say for each accession number in the list, enumerate you, and if your integer value is greater than 1, try. The reason we need to use this logic is because if the accession number is less than 1 the NCBI throws an error which messes up the down stream analysis. So if the accession number is NOT greater than zero, go to the 'Else' seciton and append it to the 'no' lists. It is important to note that I am using a facny little enumerate trick to index the other lists with the enumerated number and pull the information from that list at that position and append it to another list. I have found this trick to be extremely useful. 

So lets talk about the 'try.' First, we split the line to take off the '\n' with the .strip. We then set the url with the inserted bioprojectID inside it. We use the conn and the urllib.request to open up our url. We set the nuccoreID to the last element in the url after it is split on the '/'. If you are curious, before we split the url it looks like this [https://www.ncbi.nlm.nih.gov/nuccore/38707888](https://www.ncbi.nlm.nih.gov/nuccore/38707888), so the nuccoreID is the '38707888'. The next variable is the nuccoreUrl, and our nuccoreId is inserted into this url and connect to it. This part is cool because we can use a very simple function to save everything this url has with the .read() function. We save all the info (which should be the whole genome at this point that corresponds to that nuccoreID) and write a little fasta file, saved in the 'outpath' variable. To finish up this block of code we append our info to the lists of desire. 

The 'except' portion of this code is just so that if there is an HTTPError, we will be notified and the program will move on. 

Lastly, we will appened all the info that didnt make it into the 'try' block of code to our lists of desire. We will do more with this data later. 

In [1]:
## i is the enumerate, j is the actual entry from the list
## we will use i to index the other lists
for i, j in enumerate(bp_ID_accnum):
    print(bp_ID_accnum)
    ## this step is need to catch nuccore numbers that throw an error downstream ##
    if int(j) > 0:
        try:
            ## remove newline
            bioprojectId = j.strip()
            #print(virus[i])
            ## create URL to get redirected
            bioprojectUrl = "http://www.ncbi.nlm.nih.gov/sites/nuccore?term="+bioprojectId+"[BioProject]"
            #print(bioprojectUrl)
            ## open URL
            conn = urllib.request.urlopen(bioprojectUrl) 
            ## retrieve nuccore ID from redirected URL
            nuccoreId = conn.url.split("/")[-1]
            conn.close() 
            ## create URL for FASTA file
            nuccoreUrl = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id="+nuccoreId+"&rettype=fasta" 
            ## open URL
            conn = urllib.request.urlopen(nuccoreUrl) 
            ## get FASTA text
            fasta = conn.read() 
            ## close URL
            conn.close() 
            ## create file path to write FASTA
            outpath = nuccoreId + ".fa"
            
            
            csv.append(outpath)
            yes_genome.append(virus[i])
            yes_v_size.append(v_size[i])
            yes_vt.append(vt[i])
            
            
            # auto-close file after done
            with open(outpath,"wb") as o: 
                o.write(fasta)
        except HTTPError as e:
            print("Unable to retrieve genome for bioproject " + bioprojectId)
    else:
        ## lets turn this into a csv just like the 'found' genomes csv
        no_genome.append(virus[i])
        no_v_size.append(v_size[i])
        no_vt.append(vt[i])
            

NameError: name 'bp_ID_accnum' is not defined

#### We are just closing the file

In [None]:
in_file.close()

#### This little block is going to write all the viruses that were collected with their name, genome size, type, and their file containing its genome written above
I thought it would be useful to write a file with the genome, size of the genome, type of virus, and the corresponding fasta file for the viral genome to a master csv file. 

This zip file is quite nice, it makes a tuple of lists. I have found this function to be extremely useful as well.

In [None]:
lines = zip(yes_genome,yes_v_size,yes_vt,csv)

In [None]:
with open('virus_files.csv', 'w') as out:
    for i,j in enumerate(lines):   
        out.write("{},{},{},{},{}\n".format( i+1, j[0], j[1], j[2], j[3]))

### Now we write a file for the viruses we did NOT collect, ie that gave us an error in our try and execpt statement above.

In [None]:
lines = zip(no_genome,no_v_size,no_vt)

In [12]:
for i, j in enumerate(lines):
    with open('unblasted.txt', 'w') as out:
        out.write("{},{},{},{}\n".format( i+1, j[0], j[1], j[2]))