## Sequence Annotation - Pacific cod to Atlantic cod Genome Alignment

I aligned 4,286 loci to the Atlantic cod genome from my final run of stacks (batch 8). I want to retrieve annotations for any protein coding regions that these loci may have aligned to / near to, so that if I find any regions of interest during the sliding window analysis, I can check out the loci within that window. 

<br>
Steps for this analysis:

1. Convert .sam file to .bam file to .bed file
3. Sort .bed file
4. Download and sort the annotation file (previously completed in [this notebook](https://github.com/mfisher5/PCod-Korea-repo/blob/master/notebooks/Batch%208%20-%20Outlier%20Alignment%20verif.ipynb))
5. Run closestBed

<br>
### Convert .sam file to .bed file

In [1]:
cd ../analyses/alignment/

/mnt/hgfs/PCod-Compare-repo/analyses/alignment


*to run the following code, your .sam file must have a header*

In [5]:
!samtools view -S -b batch_8_final_filtered_gadMor2LG_filteredMQ.sam >> batch_8_final_filtered_gadMor2LG_filteredMQ.bam

In [6]:
!samtools sort batch_8_final_filtered_gadMor2LG_filteredMQ.bam \
-o batch_8_final_filtered_gadMor2LG_filteredMQ_sorted.bam

*navigate into the bedtools 'bin' directory to run the bamToBed code*

In [7]:
cd bedtools2/

/mnt/hgfs/PCod-Compare-repo/analyses/alignment/bedtools2


In [9]:
cd bin

/mnt/hgfs/PCod-Compare-repo/analyses/alignment/bedtools2/bin


In [10]:
!./bamToBed -i ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted.bam \
>> ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted.bed

<br>
### Sort the .bed file

In [11]:
!./sortBed -i ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted.bed \
> ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2.bed

<br>
### Download & Sort Annotation file (.gff)

I completed this step previously; see [PCod-Korea Outlier Alignment notebook](https://github.com/mfisher5/PCod-Korea-repo/blob/master/notebooks/Batch%208%20-%20Outlier%20Alignment%20verif.ipynb)

<br>
<br>

### closestBed

DON'T USE BEDTOOLS V2.25 to run the closestBED command.

`closest -a file.bed -b .gff -g table.tab -D b > outfile.bed`

- Argument -D a: reports the closest featured in -b (ACod) WITH its distance from -a as an extra column. Will use negative distances to report upstream features. Reports in respect to -a, so "upstream" means that the ACod gene has a higher (start,stop) than the PCod sequence.

- Argument -k: report the "k" closest hits. Default is 1
<br>

This must be done from within the bedtools2/bin folder.

In [12]:
!./closestBed -a ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2.bed \
-b /mnt/hgfs/PCod-Compare-repo/ACod_reference/gadMor2_annotation_complete_genes_manualsort.gff \
-D a \
-k 2 \
-header \
> ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2_Annotations.bed

<br>
_______________________________

#### 3/26/2018

### Remaining annotation

Fun fact about the `gadMor2_annotation_complete.gff` file - it only contains annotation information for the following chromosomes:
- LG02
- LG06
- LG08
- LG09
- LG11
- LG14
- LG15
- LG17
- LG23

There is an additional file, `gadMor2_annotation_filtered.gff`, which contains annotation information for the remaining chromosomes. 

In order to avoid having doubled annotation information, I used `grep` to isolate the remaining chromosomes from the `filtered.gff` file into two files: 

`gadMor2_annotation_remainingLG.gff` : LG01 - LG12
<br>
`gadMor2_annotation_remainingLGp2.gff`: LG13 - LG22
<br>

In [2]:
cd ../ACod_reference/

/mnt/hgfs/PCod-Compare-repo/ACod_reference


In [None]:
!grep "LG01" gadMor2_annotation_filtered.gff >> gadMor2_annotation_remainingLG.gff

These files are still too large to open in excel to sort for `closestBed`, so I again isolated lines that contained "gene".

In [3]:
!grep "gene" gadMor2_annotation_remainingLG.gff >> gadMor2_annotation_remainingLG_gene.gff

In [4]:
!grep "gene" gadMor2_annotation_remainingLGp2.gff >> gadMor2_annotation_remainingLGp2_gene.gff

#### Manual Sort of Remaining Annotation

Because of the original file format and the way that I pulled out the original annotation, I don't need to manually sort by chromosome number and start position in excel. However, I do want to combine these two files.

In [5]:
!cat gadMor2_annotation_remainingLGp2_gene.gff >> gadMor2_annotation_remainingLG_gene.gff

#### Run closestBED

In [6]:
cd ../analyses/alignment/

/mnt/hgfs/PCod-Compare-repo/analyses/alignment


In [7]:
cd bedtools2/bin/

/mnt/hgfs/PCod-Compare-repo/analyses/alignment/bedtools2/bin


In [10]:
!./closestBed -a ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2.bed \
-b /mnt/hgfs/PCod-Compare-repo/ACod_reference/gadMor2_annotation_remainingLG_gene.gff \
-D a \
-k 2 \
-header \
> ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2_Annotations_remainingLG.bed

In [11]:
!./closestBed -a ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2.bed \
-b /mnt/hgfs/PCod-Compare-repo/ACod_reference/gadMor2_annotation_LG01.gff \
-D a \
-k 2 \
-header \
> ../../batch_8_final_filtered_gadMor2LG_filteredMQ_sorted2_Annotations_LG01.bed