# Extracting sequences from Genbank for Probe Panel Design

#### _Non-segmented viruses_

Use method: `process_gb` to process non-segmented viruses.

This method takes five parameters:

1. `gb_file`: string; name of genbank file for a specific virus family or genera
2. `working_dir`: string; file path of where you want to save fasta output files
3. `virus_family`: string; name of virus family or genera
4. `min_length`: int; the minimum sequence length for extracting sequences (shorter sequences will be excluded
5. `host_taxafile`: name of text file containing taxonomic names of host of interests e.g. rodentia or shrews.

#### _Segmented viruses_

For segmented viruses, used method: `process_gb_seg`

To run this method, you will need to also specify **two** variables:

1. Dictionary containing all the gene or protein names associated with each segment
    In the below example, this dictionary object is read from a text file (`hantavirus_segment_dict.txt`)
2. Dictionary containing the minimum lengths for each segment sequence
    Currently this is hard-coded e.g. "hantavirus_min_lengths"

The method for processing segments takes the following variables:

1. `gb_file`: string; name of genbank file for a specific virus family or genera
2. `working_dir`: string; file path of where you want to save fasta output files
3. `virus_family`: string; name of virus family or genera
4. `segment_lengths`: dict; similar to min_lengths, but instead need to specifiy the minimum sequence lengths for each segment in the genome
5. `segment_names`: dict: the gene or protein names associated with e
6. `host_taxafile`: name of text file containing taxonomic names of host of interests e.g. rodentia or shrews.



In [1]:
# load python script

import genbank_to_fasta


In [2]:
#### Input files ####

### genbank file
gb_file = "/Users/jayna/Downloads/sequence 2.gb"

### File path of directory
wd = "/Users/jayna/Library/CloudStorage/OneDrive-RoyalVeterinaryCollege/probe_panel_design/virus_data"

### name of virus family
virus_family = "picornaviridae3"

### minimum sequence length
min_length = 5000

### TAXONOMY LIST
host_taxa_filename = "/Users/jayna/Evolve.Zoo Dropbox/Jayna Raghwani/PycharmProjects/processGenbank/probePanel/taxonomy_result.txt"


Non-segmented viruses

In [None]:
genbank_to_fasta.process_gb(gb_file, wd, virus_family, min_length, host_taxa_filename)

Segmented viruses

In [46]:
# Reads in text file containing alternative names associated with each segment in the genome
with open('hantavirus_segment_dict.txt', 'r') as file:
    data = file.read()

hantavirus_segments = eval(data)

# Minimum sequence lengths for each segment
hantavirus_min_lengths = {"S": 845, "M": 2200, "L": 4200}

In [47]:
virus_family = "hantaviridae"
genbank_to_fasta.process_gb_seg(gb_file,
                                wd,
                                virus_family,
                                hantavirus_min_lengths,
                                hantavirus_segments,
                                host_taxa_filename)