# Blast and GOSlim annotation of *Acropora palmata* transcriptome 

This workflow details the annotation of an *Acropora palmata* [transcriptome](https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta)

The workflow assumes that you already have a blastx database in place, and that you use SQLshare to create the final annotation. Information on setting up a blast database can be found [here](http://nbviewer.ipython.org/github/sr320/austral/blob/master/modules/01-Piura-Annotation/01-Local_BLAST.ipynb). Information on SQLshare can be found [here](https://sqlshare.escience.washington.edu/). 

In [None]:
#Create directories
!mkdir ./Data
!mkdir ./Data/Apalm
!mkdir ./Analyses
!mkdir ./Analyses/Apalm

In [None]:
cd ./Data/Apalm

In [2]:
#Obtain FASTA file
!wget https://usegalaxy.org/datasets/cb51c4a06d7ae94e/display?to_ext=fasta

SyntaxError: invalid syntax (<ipython-input-2-bfcf614d403f>, line 2)

In [None]:
!head Galaxy5-[Apalmata_assembled.fasta].fasta

In [None]:
!tail Galaxy5-[Apalmata_assembled.fasta].fasta

In [None]:
#fasta is full of double quotes (") in front of some of the (>) ...Removing " from fasta
!sed 's/"//g' Galaxy5-[Apalmata_assembled.fasta].fasta > Apalm.fasta

In [None]:
!head Apalm.fasta

In [None]:
#Count number of seqs
!fgrep -c ">" Apalm.fasta

### Blastx query

In [None]:
!blastx \
-query Apalm.fasta \ #FASTA file
-db ~blast/db/uniprot_sprot \ #Use your blastx database address
-max_target_seqs 1 \ #maximum number of target sequences = 1
-max_hsps 1 \ #maximum number of high-scoring pairs = 1
-outfmt 6 \ #output format = tabular
-evalue 1E-05 \ #E-value = 10^-5
-num_threads 8 \ #number of threads = 8
-out ../Analyses/Apalm/Apalm_blastx_uniprot.tab \ #Direct output to analyses directory
2> ../Analyses/Apalm/Apalm_blastx_uniprot.error #Direct standard error output to its own file

In [None]:
#Checking head and tail of the output file.
!head -10 ./Analyses/Apalm/Apalm_blastx_uniprot.tab

In [None]:
#Comparison of the tail with original FASTA should give an idea of whether
#the blast job is complete (note contig25409_16070 present in both)
!tail -10 ./Analyses/Apalm/Apalm_blastx_uniprot.tab

In [None]:
!wc ./Analyses/Apalm/Apalm_blastx_uniprot.tab

In [None]:
#Removing pipes and converted to tab-delimited file
!tr '|' "\t" <./Analyses/Apalm/Apalm_blastx_uniprot.tab> \
/./Analyses/Apalm/Apalm_blastx_uniprot_sql.tab
!head -1 ./Analyses/Apalm/Apalm_blastx_uniprot.tab
!echo SQLShare ready version has Pipes converted to Tabs ....
!head -1 ./Analyses/Apalm/Apalm_blastx_uniprot_sql.tab

### Manually uploading Apalm_blastx_uniprot_sql.tab to SQLShare and joining with GOSlim

`SELECT Distinct Column1 as ContigID, GOSlim_bin FROM
  [jldimond@washington.edu].[Apalm_blastx_uniprot_sql.tab]anno
  left join [sr320@washington.edu].[SPID and GO Numbers]go
  on anno.Column3=go.SPID
  left join [sr320@washington.edu].[GO_to_GOslim]slim
  on go.GOID=slim.GO_id where aspect like 'P'`

#### Output file downloaded to ./Analyses/Apalm: Apalm_GOSlim.csv

In [17]:
!head -10 ./Analyses/Apalm/Apalm_GOSlim.csv

ContigID,GOSlim_bin
contig135011_153678_153601,cell organization and biogenesis
contig135011_153678_153601,other biological processes
contig135011_153678_153601,developmental processes
contig69684,protein metabolism
contig113621,protein metabolism
contig97647,protein metabolism
contig199902,protein metabolism
contig78855,other biological processes
contig8505_94477,DNA metabolism


In [18]:
#Converting from comma to tab delimited
!tr ',' "\t" <./Analyses/Apalm/Apalm_GOSlim.csv> ./Analyses/Apalm/Apalm_GOSlim.tab

In [19]:
!head -10 ./Analyses/Apalm/Apalm_GOSlim.tab

ContigID	GOSlim_bin
contig135011_153678_153601	cell organization and biogenesis
contig135011_153678_153601	other biological processes
contig135011_153678_153601	developmental processes
contig69684	protein metabolism
contig113621	protein metabolism
contig97647	protein metabolism
contig199902	protein metabolism
contig78855	other biological processes
contig8505_94477	DNA metabolism
