# Finding a functional roll in a slding window across a BGC

## 1. Datafiles required
In this tutorial we will use the following file:  
 I.`mibig_prot_seqs_2.0.fasta` a fasta file of the MIBiG database.    
 II.`antibiotics-metadata.tsv` a tab-separated file with MIBiG metadata.    
 III.`Rast-ids.tsv` a tab separated file with Rast Ids.  
 
 
- 1.1 Lets see the first lines of the file `mibig_prot_seqs_2.0.fasta`

In [1]:
%%bash
head -n4 mibig_prot_seqs_2.0.fasta

>BGC0000001_AEK75490
MTELDRAFDAVPAPIYTHHERHGETVHRSAPESIRRELAALQVRAGDRVLEIGTGSGYSGALLAHLCCPDGQVTSIDISDELVRRAAAIHAERGVTSVDCHVGDGLAGYPAAAPFHRAVSWCAPPRLPRAWTQQVVNGGRIVACLPITALPSTTLIATITVAAGKPRIEALAGGGYAQSTPVAVDDALTVPGRWVDYCDRQPDPSWIGICWRSADDAQHTGARSALGQLLHPGYTDTYRQMEPHWRSWYTWTSALGDPQLSLVSLRNEIRGLGHTTPSSAAVILTDGRVIADRPDSPSLRSLRTWLQRWEHVGRPAPESFARTLVPHDCPDLAGWDLQVGHGSVTTDRQPPRRVDEPRRP	
>BGC0000001_AEK75491
MKPPASSVCPVDTSKMGNRSSPARYGRRPRKRCVELSETNLEFVHVVHRRHGHDDPRLGFFFLATAWQGQPVNREPHKCAGLVWTDPAQPPATTIAYTVAALEQIHSGRPFSLDGWAEHSPSATGCGIVDVAWEPPVRRGQREPRHQHDDGGRHGYTVGQPDPCRPLLGCRSTAHPTGCRPRRNVSSWLQLSPRTPGPYAISLMSRER	


- 1.2 Lets see the first lines of the file `antibiotics-metadata.tsv`

In [2]:
%%bash
head -n4  antibiotics-metadata.tsv  

SSV-2083	Streptomyces sviceus ATCC 29083	ABJJ02		Complete	no	yes													Lanthipeptide	BGC0000579	
himastatin	Streptomyces himastatinicus ATCC 53653	ACEX01		Unknown	yes	yes	2211362	yes					yes					cyclohexadepsipeptide	NRP	BGC0001117	
phosphinothricintripeptide	Streptomyces viridochromogenes	ACEZ01		Complete	yes	yes	1367200,11472937	yes										Aminobutyrates	NRP	BGC0000406	
hormaomycin, hormaomycin A1, hormaomycin A2, hormaomycin A3, hormaomycin A4, hormaomycin A5, hormaomycin A6	Streptomyces griseoflavus	ACFA01		Complete	yes	yes	21439483	yes										peptide lactone	Cyclic depsipeptide	BGC0000374	bacterial signaling


**This table contian genome Id when available** Notice that column three contains genome NCBI Id in the case that the species that produces this BGC has a complete genome available at NCBI.

In [3]:
%%bash
cut -f 3 antibiotics-metadata.tsv  | sort| uniq | head -n3



ABJJ02
ACEX01



- 1.3 Lets see the first lines of the file `Rast-ids.tsv`

In [4]:
%%bash 
head -n4 Rast-ids.tsv

951719	1887.7	Streptomyces albogriseolus LBX-2 CP042594.1
951562	1956.5	Streptomyces diastaticus SID7513 JAAGMO001
951693	1960.12	Streptomyces vinaceus ATCC 27476 CP023692.1
951703	40318.14	Streptomyces nodosus ATCC 14899 CP023747.1


**Rast Job Id** This tab separated file contains three columns, the first one is the RAST Job Id, second one RAST genome Id and finally the  third column contains the organisms name.

##. Blast databases

## 3. Obtaining BGC borders
First we will run `getSeq.sh` to obtain the sequences at the borders of a each BGC.  
`bash getSeq <BGC> <MiBiG fasta file>`  

- 1 BGCId 
- 2 MiBiG fasta db (default mibig_prot_seqs_2.0.fasta)
- 3 MiBig metadata including in third column genome Id if exist (antibiotics-metadata.tsv)
- 4 File with Rast Ids (Rast-ids.tsv)
- 5 Sliding Window size (10 in this tutorial)
- 6 Number of windows (5 in this tutorial) 
- 7 Functional word to search in genome functional annotation file (ansporter that stands for [Tt]ransporters  

This script needs as input a BGC Id from MIBiG and the MIBiG fasta file, for example `BGC0000406`.      
- getSeq obtains the fasta files `BGC0000406.i` and `BGC0000406.f`   that stands for initial and final sequence of the BGC.  
- Identifies the genome that contains this BGC using the metadata file.  
- Run a Blast search and finds the best hit of each one of the border sequences. 
- If they correspond to the BGC, call the perl script `getTrans.pl` to serach in the annotation file genes taht corresponds to the functional category in a sliding window.
  src/getTrans.pl 6772 6744 220639 BGC0000406 10 5 ansporter




2.1 First lets see that at the moment, there are not files with extensions .i or .f

In [5]:
ls *i *f

BGC0000406.f  BGC0000406.i


2.2 Now lets run getSeq

In [6]:
  %%bash src/getSeq.sh BGC0000406 mibig_prot_seqs_2.0.fasta antibiotics-metadata.tsv Rast-ids.tsv 10 5 ansporter
    

BGC BGC0000406 genome ACEZ01 file 220639
src/getTrans.pl 6772 6744 220639 BGC0000406 10 5 ansporter
BGC0000406	0	1	1	1	4	6


2.3 Find the genome file that corresponds to this BGC
BGC BGC0000431 genome ABYB01 file 512417

2.4 Find the corresponding border gene in that genome
getTrans.pl 635 656 512417 BGC0000431


2.5 Look in a window if there are transporters
BGC0000406	0	0	0	0	0